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Foreword 


Welcome everyone to LAC 2019 at CCRMA! 

For the second time in its seventeen year history, the Linux Audio Conference (LAC) 
is hosted in the United Stated of America by the Center for Computer Research in Mu¬ 
sic and Acoustics (CCRMA) at Stanford University With its informal workshop-like at¬ 
mosphere, LAC is a blend of scientific and technical papers, tutorials, sound installations, 
and concerts centered on the free GNU/Linux operating system and open-source free soft¬ 
ware for audio, multimedia, and musical applications. LAC is a unique platform during 
which members of this community gather to exchange ideas, draft new projects, and see 
old friends. 

In these times of increasing political tensions and of rising extremism throughout the 
world, we believe that emphasizing and promoting the universality of this type of event 
is of the utmost importance. The Linux audio community exists worldwide; we believe it 
should remain a priority to diversify LAC’s geographical location from year to year for the 
benefit of those who can’t afford to travel to the other side of the world. 

This year, a large portion of presenters and performers is coming from the Americas 
and Asia. LAC-19 features six paper sessions, five concerts, four workshops, one keynote, as 
well as various posters, demos, and side events happening in various locations on Stanford 
University campus. 

We wish you a pleasant stay at Stanford and we hope that you will enjoy the conference! 
Romain Michon (LAC-19 Co-Chair) 

CCRMA, Stanford University (USA) & GRAME-CNCM, Lyon (France) 
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ABSTRACT 

This paper presents the development of a virtual spacecraft simula¬ 
tor game, where the goal for the player is to navigate their way to 
various planetary or stellar objects in the sky with a sonified poi. 
The project utilises various open source hardware and software plat¬ 
forms including Stellarium, Raspberry Pi, HappyBrackets and the 
Azul Zulu Java Virtual Machine. The resulting research could be 
used as a springboard for developing an interactive science game 
to facilitate the understanding of the cosmos for children. We will 
describe the challenges related to hardware, software and network 
integration and the strategies we employed to overcome them. 

1. INTRODUCTION 

HappyBrackets is an open source Java based programming environ¬ 
ment for creative coding of multimedia systems using Internet of 
Things (IoT) technologies [1], Although HappyBrackets has focused 
primarily on audio digital signal processing—including synthesis, 
sampling, granular sample playback, and a suite of basic effects-we 
created a virtual spacecraft game that added the functionality of con¬ 
trolling a planetarium display through the use of WiFi enabled Rasp¬ 
berry Pis. The player manoeuvres the spacecraft by manipulating a 
sonic poi 1 , which is usually played in the manner shown in Figure 1. 
The poi contains an inertial measurement unit (IMU), consisting of 
an accelerometer and gyroscope; and a single button. The goal of the 



Figure 1: The conventional way of playing a sonic poi. 

game is for a player to choose an astronomical object, for example a 
planet or star, and to fly towards that object. This enables the player 
to view other objects, including planets, moons, stars and galaxies in 

1 "Poi spinning is a performance art. related to juggling, where weights on 
the ends of short chains are swung to make interesting patterns." [2, p. 173] 


the field of view. For example, Figure 2 shows how the player might 
view Saturn from Earth, while Figure 3 shows how the player may 
view Saturn from their spacecraft. The sonic poi generates sound 
that is indicative of the player’s field of view. Additionally, the poi 
provides audible feedback when the player zooms in or out. 



Figure 2: Saturn viewed from the ground from Stellarium. 



Figure 3: A closer view of Saturn from Stellarium. 


The University of New South Wales required a display for their 
open day to showcase some of the work conducted in the Interactive 
Media Lab. The opportunity to develop an environment whereby vis¬ 
itors could engage with the technology we were developing would 
not only facilitate attracting possible future students, it was also a 
way to develop and test the integration of various research compo¬ 
nents we were conducting. Many managers and business seek to en¬ 
gage new customers through gamification [3]—in this case, prospec¬ 
tive customers were potential students. Furthermore, research indi¬ 
cates that visualisation and interpretation of software behaviour de- 


1 
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veloped as part of a game is more memorable, which facilitates locat¬ 
ing errors or developing methods for improvement [4], Developing 
a game, therefore, would not only engage the visitors, it would pro¬ 
vide us with a more memorable way of seeing how our system was 
behaving. 

The technology to develop the game required two different ver¬ 
sions of Raspberry Pi, installation of planetarium software onto one 
of the Pis, and the creation of a Java API to join the different sys¬ 
tems. This paper details the strategies and techniques to integrate the 
different technologies and describes some of the workarounds for 
unresolved issues. We also discuss the goals, rules and rewards used 
to define the game and the methods we used to entice prospective 
players. Finally, we lists areas where the research can be extended. 

2. BACKGROUND TO RESEARCH 

The research was inspired by a previous project developed by one of 
the authors that correlated what a viewer saw in the night sky through 
binoculars with data obtained from on-line astronomical data cata¬ 
logues [5], One installation, which was conducted in conjunction 
with the Newcastle Astronomical Society on one of their field view¬ 
ing nights, was particularly successful [6]. More than twenty mem¬ 
bers of the public were enticed into viewing the night sky through 
high powered binoculars while sound that was based on data from 
the stars they were viewing was playing through loudspeakers on the 
field. 

Another set of performances was conducted with an improvis¬ 
ing ensemble that featured various astronomical photos displayed as 
a slide show [7], The stellar data was mapped as MIDI and success¬ 
fully functioned as inspirational impetus for the performers, but was 
unsuccessful from an astronomical point of view. First, the ability 
for viewers to look through the equipment was directly dependant 
upon the weather. One performance, for example, had a night sky 
complete with thick black cloud, heavy rain and lightning. More¬ 
over, when the weather was favourable for viewing, the audience 
were often content to just watch the performers rather than venture 
out of their chairs to view through the binoculars [5], The audience 
feedback from the was that although they really liked the slide show, 
many were unaware that the binoculars were even there for view¬ 
ing. Instead of providing a slide show at the next performance, an 
improvisation using Stellarium from a laptop computer was used on 
the screen. The audience’s response was extremely favourable, in¬ 
spiring the idea of using Stellarium as a visual stimulus instead of 
binoculars. 

2.1. Raspberry Pi 

The Raspberry Pi was originally developed in 2011 [8] for educa¬ 
tion by the Raspberry Pi Foundation, a UK based educational charity 
[9] [10]. The Raspberry PI has a very large user base and a signifi¬ 
cant number of plug in sensors available for it [11], and supports a 
128GB SD card, which can be used to store more than 200 hours of 
high-quality audio. The Raspberry Pi foundation officially supports a 
derivative of the Linux distribution Debian known as Raspbian [12]. 
Raspbian's inclusion of compilers, support for multiple coding lan¬ 
guages, and the ability to run multiple programs provides the flexi¬ 
bility that enables a system to expand as an interactive platform as 
newer technologies become available. The game project used two 
different versions of Raspberry Pi and Raspbian. The sonic poi re¬ 
quired a small form factor, low power consumption but did not re¬ 
quire a GUI, and consequently. Pi Zero running Raspbian Stretch 


Lite was selected. The device used to display the graphics required 
significantly more power but did not have size restrictions, so a Rasp¬ 
berry Pi B+ running the desktop version of Stretch was selected for 
this. 


2.2. HappyBrackets 

HappyBrackets commenced as "A Java-Based remote live coding 
system for controlling multiple Raspberry Pi units" [13] where a 
master controller computer sent pre-compiled Java classes to selected 
Raspberry Pi devices on a network. Unlike the Arduino sketch, 
which is effectively a single program [14], the HappyBrackets com¬ 
position is not a standalone executable program. The HappyBrackets 
core has a thread that listens for incoming bytecode classes, and after 
receiving the class, executes the new class’s functionality through a 
Java interface. This allows for multiple concurrent compositions that 
can be easily created or updated during composition or the creative 
coding performance [1]. This research was extended with the de¬ 
velopment of the Distributed Interactive Audio Device (DIAD) [15], 
which contained an IMU consisting of an accelerometer, gyroscope 
and compass. The devices were handled by the audience and incor¬ 
porated into the environment. The DIADS not only responded to 
user manipulation, they also responded to one another. Furthermore, 
DIADS were configured to automatically connect to the wireless net¬ 
work, and once a DIAD came into range of the network, became a 
part of the DIAD multiplicity. The main focus of this development 
was the implementation of a reusable platform that allowed creators 
to easily develop interactive audio and easily deploy it to other de¬ 
vices. Although HappyBrackets runs on many embedded platforms, 
the main research has been with the Raspberry Pi, primarily due to 
the availability and low cost of the devices. HappyBrackets is li¬ 
censed under the Apache License 2.0 2 and is available through Git 
Hub 3 . 

A prebuilt disk image—which contains the Java Virtual Machine 
(JVM), the I2C drivers to enable access to the IMU, and libraries to 
access the GPIO—enables users to flash an SD card and start us¬ 
ing HappyBrackets without ever having to connect their device to 
the Internet. The licence for the Oracle JVM, however, appeared to 
prohibit embedding the Oracle JVM into a prebuilt image and was 
therefore legally problematic. We found that the AZUL Zulu JVM 
was available under the GNU GPLv2 licence 4 , enabling an embed¬ 
ded distribution within an image. Medromi et al. conducted a study 
that compared the two JVMs [16]. Their tests revealed that Zulu 
created more threads and classes than Oracle, indicating that Zulu 
probably used more memory, making it more susceptible to garbage 
collection issues. Furthermore, their tests showed that Zulu also used 
a greater percentage of CPU, indicating greater power consumption. 
The report, however, did not detail the difference in performance 
speed between the two JVMs. Our own initial tests did not show 
any difference between the two JVMs and there was no noticeable 
performance degradation, however, this is an area we still need to 
research. It is possible to change the default JVM used in the Rasp¬ 
berry Pi from the terminal, which would make switching between 
JVMs when performing comparative tests relatively easy. 


2 www.apache.org/licenses/ [accessed November 2018] 

3 github.com/orsjb/HappyBrackets [accessed November 2018] 
4 www.gnu.org/licenses/old-licenses/gpl-2.0.txt [accessed November 

2018] 
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2 . 3 . Stellarium 

The advancement of computing power over the last two decades 
has made the availability of planetarium software available on both 
desktop computers and mobile devices commonplace. Moreover, 
many of these software packages—including RedShift 5 , SkySafari 6 , 
StarMap 1 8 , The SkvX s , and Stellarium 9 10 —have become valuable tools 
for astronomers. They facilitate the identification of objects and in 
the planning of viewing and astro-photography sessions by enabling 
sky simulation for any particular location, date and time [17]. 

Stellarium is an open source software project distributed under 
the GNU General Public Licence with the source code available 
through Git Hub 111 . Stellarium functions as a virtual planetarium; 
calculating positions of the Sun, moon, stars and planets based on 
the time and location defined by the user. Moreover, the viewing 
location does not even need to be on Earth. For example, Figure 4 
displays Stellarium rendering Jupiter viewed from its moon Io. 


II" 


Figure 4: A simulation of Jupiter viewed from Io. 

Stellarium is used by both amateur and professional astronomers, 
and is used by the European Organisation for Astronomical Research 
in the Southern Hemisphere to facilitate distribution and sharing of 
visual data among scientists [18]. Stellarium has a very high quality 
graphical display, supporting spherical mirror projection that can be 
used with a dome [19]. Stellarium is used in many schools and mu¬ 
seums because it is both scientifically accurate and visually engaging 
[18], Moreover, it is suitable for demonstrating basic through to ad¬ 
vanced astronomy concepts [18]. Stellarium has a built in library of 
600 000 stars, with the ability to add an additional 210 million [19], 
Moreover, Stellarium can display constellations from several differ¬ 
ent cultures and has labels translated to more than 40 languages, 
making Stellarium both culturally aware and inclusive [18], 

Although it is quite straightforward to control Stellarium using 
a keyboard and mouse, there are many plugins that allow third party 
integration with the software. The plugin we were particularly in¬ 
terested in to control Stellarium was the Remote Control, which en¬ 
abled control of Stellarium through HTTP [21]. Stellarium also con¬ 
tains a powerful scripting engine that enables one to program and 
run complete astronomy shows. The scripts, written in JavaScript, 


5 www.redshift-live.com [accessed November 2018] 

6 www.southcnisLars.com [accessed November 2018] 

7 www.star-map.fr [accessed November 2018] 

8 www.bisque.com [accessed November 2018] 

9 stellarium.org [accessed November 2018] 

10 github.com/Stellarium/stellarium [accessed November 2018] 


control Stellarium through a series of objects that represent the Stel¬ 
larium application components [20]. 

3. RELATED WORK 

Video games rose from obscurity in the 1970s, into a video arcade 
industry grossing $8 billion dollars in 1982 [22, p. 88]. The video 
game moved from the arcade into the home with Nintendo and Atari 
game consoles [22, 23], Iconography games like Space Invaders , 
Defender , Spaceward HO! and Star Wars were often replaced with 
interactive games that became more realistic [23]. Wolf suggests 
that there are more than forty different genres of video games [23], 
however, we were only particularly interested in the "Training Sim¬ 
ulation" genre. 

One study showed that video game expertise developed over 
long-term playing had a beneficial effect on the spatial skills in the 
player, supporting the hypothesis that "video expertise could func¬ 
tion as informal education for the development of skill in manipu¬ 
lating two-dimensional representations of three dimensional space" 
[22, p. 93]. The aerospace industry has employed training simulators 
for many years, with the advancement in virtual reality environments 
leading to the availability of a new technology known as "serious 
gaming" [24, p. 655]. This technology exploits popular high-quality 
computer games, making it available via Software Development Kits 
(SDKs) to developers of "serious" [sic] applications such as defence, 
surgery, education and aerospace [24, p. 686], 

One particularly interesting training simulation project was a 
prototype environment for training astronauts in a simulated zero 
gravity environment for the purpose of controlling and handling ob¬ 
jects [25]. Ronkko et al. noted that astronauts discovered using a 
laptop in a zero gravity environment was completely different to us¬ 
ing it on Earth, and that the whole concept of a laptop computer in a 
zero gravity environment was questionable [25, p. 183]. 

There have been various implementations of third party integra¬ 
tion with Stellarium. Although it is possible to remotely control a 
telescope using Stellarium as the master controller [26], some re¬ 
searchers have developed projects whereby Stellarium becomes the 
slave. Tuveri et al. developed two planetarium control systems 
for driving Stellarium on a Laptop computer [27], They extended 
the Stellarium code in order to send it application messages before 
the Remote Control plugin was available in the standard Stellarium 
distribution. One interaction implementation was through a touch 
screen, while the other was through a Kinect gesture controller [27], 

The Remote Control Stellarium plugin was developed by "Flo- 
rian Schaukowitsch in the 2015 campaign of the ESA Summer of 
Code in Space programme" [20, p. 110], and was used for a vi¬ 
sual art installation in the MAMUZ museum for pre-history [21], 
The installation, STONEHENGE. A Hidden Landscape, consisted 
of a single computer driving five projections onto a 15x4m curved 
screen.The presentation was automated with a Raspberry Pi that trig¬ 
gered a script via an HTTP request every twenty-five minutes via a 
cron job. This Remote Control plugin is now a standard part of the 
Stellarium installation. This use of both scripting and HTTP control 
was the mechanism we employed in our game. 

4. DEFINING THE GAMIFIED EXPERIENCE 

One of the intentions of creating the gamified environment was to en¬ 
gage visitors. In the gamified experience, four parties are involved: 
players, designers, spectators, and observers [28], The key to a de- 
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veloping successful gamified experience is to identify who the par¬ 
ties are and how to engage them for the purpose of creating a positive 
and memorable experience [3], each with different levels of involve¬ 
ment or immersion [28]. Players were the visitors who physically 
controlled the virtual spacecraft, and in a sense, were the competi¬ 
tors and highly immersed in the experience. Spectators were people 
who do not directly compete in the game, but instead, influenced the 
game indirectly by encouraging the player and were also highly im¬ 
mersed in the experience. Observers were other visitors in the space 
that were passively involved and had no direct impact on the game. 
They were, however, mildly involved and often moved to become 
players or spectators [28]. 

Research indicates that the three main factors in developing an 
enjoyable game were challenge, fantasy and curiosity [29]. We pro¬ 
vided challenge in that we set a goal that had increasing levels of 
difficulty. As the user was closer to the planet, the spacecraft be¬ 
came more difficult to control. 

We utilised fantasy in that we implement two modes of play: 
terrestrial and spaceship. Terrestrial mode allowed the player to use 
gravity in a familiar way, provided wide fields of view that showed 
large amounts of sky and provided course control. Spaceship mode 
showed less fields of view, displaying significantly less sky and pro¬ 
vided finer control; however, the player was not allowed to use grav¬ 
ity in their control. We enabled the player to zoom in and out by per¬ 
forming a quick twist action of the ball around the string. If the gyro¬ 
scope pitch value exceeded the set threshold, the field of view would 
change, simulating a zoom in or out. When the user changed their 
field of view to less than 30 degrees, the play mode went from ter¬ 
restrial to spaceship. We provided an audible feedback that sounded 
like a zipper when the level of zoom was changed. 

The only controls available at the time on the poi were accelerom¬ 
eter and gyroscope 11 , while the only feedback was audio generated 
by the poi and the Stellarium display. In the same way that a laptop 
could not be used conventionally in a zero gravity environment [25], 
a player would be unlikely to control the game successfully using 
the poi by spinning it around their body [2], Figure 5 shows the poi 
with three axes of accelerometer and gyroscope on the left and right 
respectively. 



Figure 5: Sonic Poi accelerometer and gyroscope input. 


In terrestrial mode, we wanted to simulate a viewer on the ground 
lifting and turning their head to view the sky as one would on Earth, 
which is essentially increasing the altitude and rotating the azimuth. 
The player "lifts their head" by raising the ball of the poi in an arc, 
using the point where the player holds the rope as the centre, and 
measuring Y axis acceleration through the IMU in the poi. Rotating 
the viewer’s head was simulated by detecting the pitch value of the 


11 The button control was added to the poi later. 


gyroscope, as shown on the right side of Figure 5. Gyroscope val¬ 
ues only change while the object is rotating, whereas gravitational 
accelerometer values are maintained when the object is stationary. 

In the spaceship mode, we wanted to simulate the player nav¬ 
igating through space in a zero-gravity environment. The yaw and 
the pitch were used as input, whereby the user had to roll the ball 
in their hands to move the display. This was completely foreign to 
users at first because there was no haptic feedback, nor any sense 
of grounding for the user or the control. In a sense, it was similar 
to balancing on a ball in space because you could not fall off—you 
would just float in an unintentional direction. Furthermore, it was 
not easy to detect which axis was which because the poi was a ball 
shape. Furthermore, rotating one axis would affect the cognition of 
the other axis. Consider a player in Figure 5 rotating the ball for¬ 
ward around the X axis with the poi producing a positive yaw. If 
the player then turned the poi 180 degrees around the string, rotat¬ 
ing the ball forward again would now produce a negative yaw, which 
would mean the screen would start moving in the opposite direction 
to what they experience a moment earlier. The result was that con¬ 
trolling the display required constant mental adjustment, which we 
suggested might simulate to some degree the sense of strangeness an 
astronaut may feel controlling objects in outer space [25, p. 183]. 

In order to run an attractive and engaging display that would trig¬ 
ger the visitors’ curiosity when they entered the room, we ran Stel¬ 
larium scripts that functioned as standalone astronomy shows. We 
invited visitors to manipulate the poi and watch the display move 
while the script was running. When we saw they were interested 
and enjoyed the novelty of interacting with the display through the 
poi, we offered them the opportunity to start from Earth and navi¬ 
gate to one of the planets in our solar system. As they zoomed in 
closer to Saturn, they became quite excited when they saw the rings 
and realised that they could also see Saturn's moons. For those who 
were particularly enthusiastic, we suggested finding Jupiter next, in¬ 
forming them that they would also be able to see the four Galilean 
moons that night at home with a standard pair of binoculars. We also 
asked them to imagine that rolling the ball to control their movement 
might be as strange as moving about in a zero gravity environment. 
Although a few of the players gave up after a few minutes, the ma¬ 
jority of players continued for more than ten minutes, had a lot of 
fun, and exhibited a sense of achievement in being able to navigate 
into outer space. 


5. DEVELOPMENT 

The system was originally developed as a tool for evaluating the per¬ 
formance, behaviour and suitability of networked control of Stellar¬ 
ium as part of a potential interactive audio visual artwork. We in¬ 
tended to calculate the azimuth and altitude position in space calcu¬ 
lated from the rotation and manipulation of poi. These values would 
be used as input to Stellarium on another device, sent via the net¬ 
work, which would then display the sky based on those values. Ad¬ 
ditionally, we sent commands to change the field of view on Stel¬ 
larium, which effectively acted as a zoom function. The poi also 
played audio as a series of ten uniformly distributed pseudorandom 
sine waves between 500 and 550Hz, giving a sense of cosmic back¬ 
ground microwave noise. 

float freq = hb.rng.nextFloat() * 50 + 500; 
Envelope envelope = new Envelope(1); 
WaveModule soundGenerator = new WaveModule(); 
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soundGenerator.setFequency(freq) ; 
soundGenerator.setGain(envelope); 
soundGenerator.connectTo(masterGain); 
return envelope; 

A metronome iterates though each of the envelopes, adding segments 
that cause each frequency to momentarily pop out of the background 
as a beep. 

hb.createClock(5000) .addClockTickListener ( ( 
offset, this_clock) -> { 

Envelope e = envelopes.get(envelopIndex++ % 
TOTAL_OSCILLATORS) ; 
final float LOUD_VOL = 20; 
final float LOUD_SLOPE = 20; 
final float LOUD_DURATION = 200; 

e.addSegment(LOUD_VOL, LOUD_SLOPE); 
e.addSegment(LOUD_VOL, LOUD_DURATION); 
e.addSegment(1, LOUD_SLOPE); 

}) ; 

As the user zooms in, the metronome becomes faster, increasing 
the beep rate, generating a sense of sonic tension. 

5.0.1. Starting Stellarium 

The first challenge was starting Stellarium on the Pi from within 
HappyBrackets. HappyBrackets has a simple facility to execute shell 
commands or create processes through both the Java Runtime exec 
and the ProcessBuilder [30]. We attempted a script to run Stellar¬ 
ium from a process command, which ran successfully when executed 
from a terminal; however, we could not get HappyBrackets to run 
the script after each fresh reboot of the device—the program was un¬ 
able to access the display. Interestingly, If we killed the JVM and the 
started HappyBrackets again from a terminal, then Stellarium started 
from within HappyBrackets with no problem. The problem was that 
the HappyBrackets installation script had configured the Raspberry 
Pi to automatically start the Java application when the device first 
boots by executing a script in /etc/local.rc as defined in the Raspberry 
Pi documentation 12 . In order to run GUI programs from Java, the 
Java program needs to be started when the desktop starts, which was 
effected by moving the script command to /.config/lxsession/LXDE- 
pi/autostart 13 . The HappyBrackets installation scripts were conse¬ 
quently modified to detect whether a desktop version was used, and 
added the HappyBrackets start-up script command accordingly. 

5.0.2. Controlling Stellarium 

Examples of controlling Stellarium through the Remote Control API 
were provided on the plugin developer page 14 , which made use of 
the cURL [sic] 15 command line utility 16 and executed via an SSH 
terminal connection to the Pi. Although we did not intend to use curl 
in our actual program because Java has its own networking interface, 
curl was extremely useful for examining and diagnosing through the 

12 www.raspberrypi.org/documentation/linux/usage/rc-local.md [accessed 
November 201B] 

13 www.raspberrypi.org/forums/viewtcpic .php?t=139224 [accessed 

November 2018] 

l4 stellarium.org/doc/head/remoteControlApi.html 

15 curl.haxx.se 

16 cURL should not be confused with the curl programming language. 
ec.haxx.se/curl-name.html [accessed November 2018], 


terminal. Querying the state of Stellarium was performed by issu¬ 
ing a curl GET command. For example, executing the following 
command in the SSH terminal retrieves the current view direction of 
Stellarium as a JSON encoded string. 

curl -G http://localhost:8090/api/main/view 
{"altAz":"[0.954175, 9.54175e-06, 

0.299249]"j2000":"[0.240925, 0.147495, 
-0.959271]","jNow":"[0.241334, 0.148053, 
-0.959082]"} 

Setting the position of Stellarium is executed with the curl POST 
command, with the parameters added as JSON parameters. Execut¬ 
ing the following command would set the display to horizontal by 
setting the altitude to zero. 

curl -d 'alt=0' http://localhost:8090/api 
/main/view 

Having tested the functionality using curl through the terminal, we 
implemented calls using the standard Java URL connections [31]. 
We sent control message from the poi via UDP to the slave using 
HappyBrackets and then immediately sent the HTTP message on the 
slave to Stellarium. We found that although the message arrived from 
the poi to the slave in less than a few milliseconds, the time to execute 
the post message on localhost, be actioned by Stellarium, and then 
return typically took between 80 and 120 milliseconds. This pro¬ 
duced accumulative latency when the player continually moved the 
poi. The accelerometer and gyroscope typically update every 10ms, 
so constantly rotating the device for two seconds would generate 
approximately 200 messages. These values would become queued 
inside the slave and sequentially executed, which would result in an 
accumulating latency over a twenty second period. A method was re¬ 
quired that would immediately send the last received position change 
when the last message was complete, but would discard previous 
values that were not yet actioned. We accomplished this through 
an independent thread for executing the post command. This thread 
would be effectively dormant while waiting for an event. When a 
message arrives on a different thread, the event is triggered, at which 
point the thread wakes and sends the message. We effected this 
through the use of Java synchronisation objects. The functionality 
that sends the post messages to Stellarium executes in an indefinite 
loop, laying dormant through the altAzSynchroniser. wait 
() call. 

new Thread(() -> { 

while ( !exitThread) { 

synchronized (altAzSynchroniser){ 
try { 

altAzSynchroniser.wait(); 

} catch (InterruptedException e) 

1 

e.printStackTrace (); 

} 

} 

sendAltAz(currentAz, currentAlt); 

} 

}) .start (); 

The thread will wait indefinitely until it receives a signal from 
variable altAzSynchroniser. When a message to change the 
altitude arrives from the poi, the class variable currentAlt is set 
and the altAzSynchroniser object is notified, which in turn 
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causes the thread shown above to wake and then call sendAltAz 
with the new azimuth and altitude to the localhost. 

public void changeAltitude(double 
control_val) { 

synchronized (altAzSynchroniser){ 
currentAlt = control_val / 2 * Math.PI; 
altAzSynchroniser.notify (); } 

} 

We found that modifying the azimuth and altitude directly often 
produced a jittery display due to the 100ms latency coupled with 
discarding of values that were not actioned while waiting for the 
sendPostMessage call to return. We reduced this problem sig¬ 
nificantly by sending arrow key messages and moved the display left 
and right instead of sending an azimuth. This produced a smooth 
display rotation when rotating the ball. It was not possible to use 
this for the altitude in the terrestrial mode because we were using the 
accelerometer value to determine the height. In the spaceship mode, 
however, this proved very effective as we were able to just send up, 
down, left and right messages based on gyroscope action. 

6. FUTURE WORK 

There were several issues that we discovered through running the 
game. The first problem was that the Raspberry Pi would often crash 
when running the display after a certain period of intense manipula¬ 
tion, however, we were able to run it for several days if we did not de¬ 
mand too many rapid changes from Stellarium. We substituted the Pi 
with a Mac Mini in order to determine where the problems were. We 
found that we were able to reproduce an error in Stellarium on the 
Raspberry Pi when running the script double_stars.ssc that comes 
with Stellarium, however, the Mac ran with no errors. Running the 
kernel journal showed errors indicating an inability to allocate mem¬ 
ory within the GPU 17 . The VC4 OpenGL driver required to run Stel- 
larium is still experimental, and it is probably that this is where the 
error lies. Research and development in this area is still required to 
make a stable Raspberry Pi installation of Stellarium. 

We found that when the player started rotating the ball fast, the 
zoom would activate, requiring them to stay within certain rotation 
rates. We modified the game so changing zoom required the player 
to hold the button down when performing a zoom action. 

Messages are sometimes lost over UDP, which became evident 
when a zoom message was sometimes not delivered to the slave. 
We have performed some tests comparing different routers and dif¬ 
ferent Raspberry Pis for packet loss. Additionally, we tested code 
in both lava and C++. We discovered that as packet intervals ex¬ 
ceeded 10ms, the percentage of packet loss increased. Interestingly, 
we found that there was less packet loss using Java than C++ using 
the standard compilers distributed with Raspbian. Furthermore, the 
quality of router had a significant impact. Some routers, although 
supporting multicasting, stopped sending multicast messages to de¬ 
vices after about ten minutes. We intend to perform more tests re¬ 
garding the packet loss, however, the real concern is that broadcast¬ 
ing and multicastling of OSC over UDP is not satisfactory [33]. 

We found that the Just In Time (JIT) compiler took time to con¬ 
vert the downloaded Java byte code into machine code [34], produc¬ 
ing a brief stuttering effect when executed for the first time. The 
problem became exacerbated when using the Pi Zero with ten os¬ 
cillators running simultaneously due to the limited power of the Pi 

17 github.com/Stellarium/stellarium/issues/550 [accessed November 2018] 


Zero. Once the JIT compiler had converted the code, subsequent 
code changes were not affected. Although only an issue when the 
program starts, we need to examine strategies to overcome this. 

7. CONCUUSIONS 

During our research we were able to integrate various open source 
programs to create a system where we could develop and evaluate 
Stellarium as a controllable display element, create inter process 
and device communication using the HappyBrackets Java environ¬ 
ment, and to experiment with the use of the sonic poi as a per¬ 
formance tool. We used this system to create a gamified environ¬ 
ment where visitors were engaged with our technology, providing 
them with a positive and memorable experience. We capitalised on 
this opportunity to observe and evaluate how our system was behav¬ 
ing, which was more memorable to us by virtue of it being part of 
a game that was played repeatedly. We leveraged the quality the 
Stellarium display coupled with a wireless control device to create 
a game that was challenging, fun, engaging and educational. More¬ 
over, the technical goal was to be able to control Stellarium during 
a performance with HappyBrackets, with an example available at 
https://youtu.be/NhXRdd-MNoo 

The research obtained from developing this game can be used 
as a starting point for the development of an interactive educational 
installation. Furthermore, we found a way to expose issues with 
OpenGL driver on the Raspberry Pi, Java JIT, and UDP packet loss 
and performance using both Java and C++. 
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ABSTRACT 

TimeWorkers is a programming framework for coding sonification 
projects in JavaScript using the Web Audio API. It is being used for 
sonification workshops with scientists, doctors, and others to facil¬ 
itate ease of use and cross-platform deployment. Only a browser 
and text editor are needed. Using Free and Open-source Software 
(FOSS) the system can run standalone since No Internet is Required 
for Development (NIRD). Workshop participants rapidly master prin¬ 
ciples of sonification through examples and are encouraged to bring 
their own datasets. All mapping code is contained in a project’s 
.html landing page. A single generator function iterates over the 
project’s data series and provides a fine-grained interface to time- 
varying sound parameters. This loop and its internals are patterned 
after similar constructions in the Chuck language used by the author 
in earlier sonification tutorials. 

1. INTRODUCTION 

Sonification shares much with other kinds of computer music mak¬ 
ing including the wide range of programming tools which can be 
used. Sonification also shares in the kinds of decisions found in pho¬ 
tography and soundscape recording. Gathering, selecting, framing 
and contrast enhancement are a part of working with material from 
the (outside of music) outside world. On the other hand, another 
key part of creating a sonfication, mapping, has affinities with al¬ 
gorithmic composition. TimeWorkers is a browser-based software 
framework described in this paper which, while not limited to sonifi¬ 
cation, provides in it’s initial rollout functional support for decisions 
specific to such work. 

Specialized programming languages have evolved and continue 
to evolve which are custom-designed to express musical relation¬ 
ships, especially timing and concurrency. I’ve used several over the 
course of composing computer music with succeeding generations 
of hardware platforms, for example, Pla[l], MIDILisp[2], Common 
Music[3] and Chuck[4], all of which are examples of computer mu¬ 
sic languages with ways of programmatically expressing organiza¬ 
tion of sound in time. 

TimeWorkers is written in JavaScript and provides a readily avail¬ 
able computation environment for my sonification workshops. To 
give a glimpse of what will be explained later in detail, the name 
comes from its use of the Web Worker API[5] for composing musical 
layers or voices which unfold in time. The software uses browsers’ 
existing means for sound generation, in this case the built-in com¬ 
puter music capabilities of the Web Audio API[6], The added func¬ 
tionality provided by TimeWorkers provides ways to compose higher- 
level aspects of musical timing and texture. 

Stepping back for a moment, it’s worth reflecting on how com¬ 
puters and music have been mingling their intimate secrets for over 
50 years. These two worlds evolve in tandem and where they in¬ 
tersect they spawn practices that are entirely novel. One of these is 


sonification, the practice of turning raw data into sounds and sonic 
streams to discover new relationships within the dataset by listening 
with a musical ear. This is similar to exploring data visualization 
with strategies made for the eye to reveal new insights from data 
using graphs or animations. A key advantage with sonification is 
sound’s ability to present trends and details simultaneously at multi¬ 
ple time scales, allowing us to absorb and integrate this information 
the same way we listen to music. 

Kramer, et al.’s prescient Sonification Report [7] (2010) merits 
quoting here at length and will be revisited in the conclusion sec¬ 
tion. The paper identified “three major issues in the tool develop¬ 
ment area that must be tackled to create appropriate synthesis tools 
developed for use by interdisciplinary sonification researchers.” The 
TimeWorkers framework addresses some (but not all) of the follow¬ 
ing points. 

“Portability: Sonification scale places demands on audio hard¬ 
ware, on signal processing and sound synthesis software, and on 
computer operating systems. These demands may be more stringent 
than the requirements for consumer multimedia. Researchers deal¬ 
ing with problems that go beyond the limits of one system should be 
able to easily move their sonification data and tools onto a more pow¬ 
erful system. Thus, tools must be consistent, reliable, and portable 
across various computer platforms. Similarly, tools should be capa¬ 
ble of moving flexibly between real- time and nonreal-time sound 
production.” 

“Flexibility: We need to develop synthesis controls that are spe¬ 
cific and sophisticated enough to shape sounds in ways that take ad¬ 
vantage of new findings from perceptual research on complex sounds 
and multimodal displays and that suit the data being sonified. In ad¬ 
dition to flexibility of synthesis techniques, simple controls for alter¬ 
ing the data-to-sound mappings or other aspects of the sonification 
design are also necessary. However, there should be simple ‘default’ 
methods of sonification that allow novices to sonify their data quick 
and easily.” 

“Integrability: Tools are needed that afford easy connections to 
visualization programs, spreadsheets, laboratory equipment, and so 
forth. Combined with the need for portability, this requirement sug¬ 
gests that we need a standardized software layer that is integrated 
with data input, sound synthesis, and mapping software and that fa¬ 
cilitates the evaluation of displays from perceptual and human fac¬ 
tors standpoints.” 

2. USING THE FRAMEWORK 

Meant to be very hands-on, my 2-hour workshops ask the partici¬ 
pants to bring their own laptop and headphones. I first take them 
through a simple example which has a been an early “etude” assign¬ 
ment in my course, “Computer Music Fundamentals ”[8], taught at 
Stanford’s CCRMA. The goal is to get students to start working with 
their own datasets as soon as possible and get them exploring a range 
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of sonifications through experimentation. 

A dataset to play with can be scouted out by searching the web 
and copied or exported from a spreadsheet or other format. For 
starters, it’s simply a single column of numbers in plain text. The 
range of values doesn’t matter because it will be automatically rescaled 
when read by the framework’s file input layer. In my own develop¬ 
ment work, examples and code repository are all linux-based and 
other operating systems work equally well. 

2.1. Basic Sonfication How-to 

2.1.1. What you ’ll need 

The browser can be a recent version of Firefox, Chromium, Chrome, 
or Edge. A simple text editor like Gedit is all that’s required for 
developing the code and preparing an ASCII data file. 

2.1.2. Testing the demo 

Open the demo URL https : / /ccrma .stanford.edu/~cc/ 
sonify to see a page that looks like Figure 1. There’s a default 
time series “tides.daf’ that can be played by clicking on the demo 
icon (the small globe is a button). 


[3 basic sonification example X 

<- e ft ©ft 


Drag a data file from 
your desktop here to 
play it in the browser 


HIS" a 

tides demo 

click to play data/tides.dat 


Figure 1: An example page with options for playing a default time 
series or dragging in a datafile. 

Alternatively, a data hie can be dragged from the desktop onto 
the page to sound it with the same preset sonification parameters. 

The demo was created by Chris Hartley, a biologist who par¬ 
ticipated in the first workshop (in 2016) at the University of British 
Columbia. In it, “ You can hear the rising and then falling chirp- 
chirp-chirp of the major high tides, which get highest at the new 
and full moons, and then the slightly lower trill of two roughly equal 
high tides per day, which occurs during the quarter moons.” Hart¬ 
ley’s sonfication plays a year’s worth of tidal data at a fast rate using 
a sine tone. 

After starting the demo or after loading a data Hie the stop and 
play buttons on the web page become activated, Figure 2. 

2.1.3. Modifying the demo 

To practice modifying the demo, a good first goal is to make the rate 
of running through the data much slower. To accomplish this, we’ll 
make a local copy of the demo, test it and then edit it. 

Go to its repository https : / / cm-git lab .stanford.edu/ 
cc/sonify and download a snapshot. The downloaded .zip Hie 


H basic sonification example X 

C 1 lil ©ft https;//ccrma.stanford.edu/- 


number-of-earthquakes- 

per-year-m.dat 



Figure 2: Stop and play buttons become activated after starting the 
demo or dragging in a datafile. 


will have a long name that depends on the version. Extract the con¬ 
tents of the .zip file and open its index.html file in a browser (use 
Firefox because it will allow the demo to run as a local file without 
manual intervention). 

This will allow you to test the local copy of the landing page in 
a browser and make sure it’s working identically to the version on 
the workshop’s web server. If it’s all good, then the local copy of the 
landing page can be opened in a text editor. Search for the line 
let dur = 0.005 
and assign a new value, for example: 

function* sonify(data) { 
let dur = 0.05 

// duration between data points in seconds 

Save the modification in the text editor and then refresh the browser 
page to load the changed file. The example can then be played as be¬ 
fore but the rate will now be ICte slower. 

Further modifications are quickly explored with the same work 
flow of edit-save-refresh-play. For example, in the mapping function 
map(v) 

where, for a given value of v, sound parameters are determined for 
pitch and loudness (respectively, kn in MIDI key number units and 
db in a decibel range from —100 to 0). These in turn are used to 
calculate values which will be applied to the sine tone’s frequency 
(Hz) and amplitude (range 0.0 to 1.0): 

function map(v) { 

let kn = 60 + v * 40 
let f = mtof(kn) 
let db = -30 + v * 10 
let a = dbtolin(db) 
return {pit: f, amp: a} 

} 

map (v) returns pitch frequency and loudness amplitude in an 
object created by an object initializer. Its argument, v, is expected 
to lie in the range 0.0 to 1.0. In a hidden step which happens when 
the data is loaded, the data series has been automatically normalized 
to this range, map (v) is set so that the lowest data value will be 
sounded at Middle-C (MIDI key number 60) and the highest will be 
3 Octaves and a Major Third above. Intermediate values will be lin¬ 
early interpolated across key number values (using fractional quanti¬ 
ties, in other words, not quantized to integer key numbers). Code for 
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the utility functions mtof and dbtolin, respectively for conver¬ 
sion front MIDI key number to frequency in Hz and dB loudness to 
amplitude, have been borrowed from Hongchan Choi’s Web Audio 
API Extension (WAAX) project [9], 

The sonify generator function sets a new target pitch when pro¬ 
cessing each new data value and starts a glissando (a smooth fre¬ 
quency ramp) to reach the target pitch in the length of time specified 
by the data update period, dur. The ramp is a linear function which 
updates the sine tone’s frequency each audio sample. Amplitude is 
smoothly modulated in the same way. 

The complete sonify generator function for this example is listed 
below and includes a definition of the sound source along with a 
mechanism for applying updates to its parameters. The new func¬ 
tion Sin (timeWorker) instantiates a SinOsc and several meth¬ 
ods which start the oscillator, apply parameter updates to it and stop 
it. After instantiation as a local object s, it is initiated with the first 
values from the mapping function and a gain of 0. Ramps are set 
in motion and the process pauses until they reach their targets with 
yield dur after which the loop continues and cyclically churns 
through each data point until all have been “performed.” The last 
few lines ramp the oscillator to 0 and then stop and finish. 

function* sonify(data) { 
let dur = 0.005 
let datum = data.next () 
function map(v) { 

let kn = 60 + v * 40 
let f = mtof(kn) 
let db = -30 + v * 10 
let a = dbtolin(db) 
return {pit: f, amp: a} 

1 

function Sin(timeWorker) { 

let s = new SinOsc(timeWorker) 
s.start () 

this.setPit = function(freq) { s.freq( 
freq ) } 

this.setAmp = function(gain) { s.gain( 
gain ) } 

this.rampPit = function(freq,dur) { s. 

freqTarget( freq,dur ) } 

this.rampAmp = function(gain,dur) { s. 

gainTarget( gain,dur ) } 

this.stop = function() { s.stopO } 
this.ramps = function (f,a,d) { 

this.rampPit(f,d) 
this.rampAmp(a,d) 

} 

1 

let sin = new Sin(this) 

if (withFFT) postMessage("makeFFT()") 

let params = map(datum.value) 

sin.setPit(params.pit) 

sin.setAmp(0) 

while (!datum.done) { 

sin.ramps(params.pit, params.amp, dur) 
yield dur 

if (withSliderDisplay) postMessage(" 
movelD("+datum.value!" ) " ) 
if (withChart) postMessage("move2D()") 
datum = data.next() 


params = map(datum.value) 

} 

sin.rampAmp(0,0.1) 
yield 0.1 
sin.stop () 

postMessage("finish()") 

1 

Workshop discussions are mostly focused on customizing the 
above code and demonstrating extensions described later in this re¬ 
port. What follows in the next section is a discussion of the Time- 
Workers framework “under the hood." This can be skipped if one’s 
main interest is in customizing Bonifications rather than digging into 
the underlying system. 

3. PROGRAMMING STRUCTURE AND SUPPORTING 
FUNCTIONS 

The framework has no dependencies. It is a lightweight project 
which is Free Open-source Software (FOSS) and has the additional 
feature of No Internet Required for Development (NIRD). Work¬ 
shops and individual work are equally possible online and offline, for 
example, during field work with no connectivity. A project’s .html 
landing page loads a single associated script file, engine.js, which 
contains all supporting functions. Files and modules are shown schemat¬ 
ically in Figure 3. 



data file Worker thread 

runs sonify loop 


Figure 3: Structure and modules. 

The project landing page sets up web-related configurations, spec¬ 
ifies the user interface (UI), loads the script file, engine.js, and is 
where the sonification is “composed.” Various “hardwired” globals 
need to be declared which will be communicated to the script file, in¬ 
cluding a default value for dataFileName. Likwise, the script file 
expects a “hardwired” generator function with the name sonify 
(which should be defined using JavaScript’s function* syntax [10]). 
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Table 1: project files 


web landing page 

supporting script 

index.html 

engine.js 


Table 2: index.html elements 


<head> 

<body> 

<script> 

<meta> specifies metadata 

configures UI elements 

sets global and local variables 

(optional) <script> loads any auxiliary script files 

(options to hide or expose) 

loads engine.js 

e.g.. graphing library 

e.g., drag and drop 

must define function* sonify(data) 


Table 3: engine.js tasks, classes (and optional functionality) 


set locals 

polish UI 

specify web worker(s) 

set up spork mechanism 

define DSP ugens 

audio context 

check browser capabilities 

WorkerThread 

Timelterator 

e.g., SinOsc 

data source 

get UI elements 

uses inline definitions 

play / stop 

e.g., FM 

timing cushion 

set UI element states 

(add graphing capability) 

nextEventAt 

uses setValueAtTime, 

worker arrays 

(add drag and drop) 

(connect real-time UI elements) 

uses async / await 

linearRampTo Value AtTime 


This function instantiates any unit generators (ugens) it will be us¬ 
ing, for example with 

new SinOsc(timeWorker) 

as shown above, and specifies data-to-sound parameter mappings 
which unfold through time. 

For brevity’s sake the script file, engine.js, is not reproduced here 
but can be found in js/ subdirectory of the project repositoryfl 1]. 
This script provides the Time Workers structure through its class def¬ 
initions, functions and own variable settings. Any special tokens 
which are referenced by the sonify generator function, e.g. SinOsc 
will be resolved against what is defined or declared in the global 
scope after engine.js has been loaded. 

The script file contains several parts. Setting local variables, pol¬ 
ishing the UI and a system for “performing” sonifications composed 
with the sonify generator function. 

A WorkerThread interface sets up and runs this time-sensitive 
apparatus in separate threads. The Timelterator class provides a 
mechanism which waits between events in the sonify generator’s 
loop and compensates for timing jitter. It uses the performance 
. now () clock to compare real time with expected logical time. Fi¬ 
nally, the ugen part of the script file defines any synthesis or DSP 
patches which are used. 

var context 

is declared to hold the window.AudioContext which gets instantiated 
at sound start and closed at sound stop, 

var workerThreads = [] 

is the array containing the pool of WorkerThread instances and 
var uwta = [ ] 

is a multi-dimensional array (whose name is shorthand for '‘ugen- 
WorkerThreadsArravs”) that contains the set of all ugens in all Work¬ 
erThreads. 

A programming pattern often used in sonification in the Chuck 
language [4] has two aspects. The first is the spork function which 
calls a given function in a parallel, separate thread with its own 
logical timebase. (A child process spawned by a sporked function 
can also spork its own child processes.) The second construct is a 


means for looping over data, in Chuck this is usually a while loop 
where event time advances each iteration. The loop executes in its 
own thread. The present framework supports both features using its 
WorkerThread and Timelterator constructs. 

When makeWorkerThread (Table 1) creates a new instance, 
the spawned JavaScript Worker [12] is of a special inline type (as op¬ 
posed to the more common type which is usually created by loading 
a dedicated script file). 

var blob = new Blob([script]) 

var worker = new Worker(URL.createObjectURL( 
blob)) 

The script passed into the new Blob sets up a mechanism for dynamic 
object definition. It calls addEventListener on the new worker 
and sets how the worker will handle incoming messages. By telling 
it to handle them with an eval (in the global scope), the worker’s 
set of variables and functions is literally “grown” by posting message 
strings to be evaluated which contain the desired definitions and set¬ 
tings. One of these, for example, is the sonify function defined back 
in the landing page. Dynamically defining time Workers in this way 
allows the sonify function to also spork processes which will become 
its own new child workers each of which runs in a separate thread. 

The spork function itself instantiates a time-sensitive data iter¬ 
ator with makeTimelterator. A Timelterator will pause a gen¬ 
erator for a given duration with its method nextEvent At () which 
is an async function utilizing JavaScript’s async / await ([13]) paus¬ 
ing functionality. When sporked, a sonify generator’s loop is started 
with nextEventAt (" start " ) that executes its first cycle. A 
subsequent yield in the sonification loop will set the amount of 
time to pause on the next call to nextEventAt (which calls itself 
recursively) and the loop continues. 

In the definition below, f star is the sonify generator defined in 
the landing page and args contains a data iterator with the provided 
data series (which has had its range normalized). 

function spork(fstar, ...args) { 
let ti = makeTimelterator() 
ti.sporkScript = fstar.apply ( ti, args ) 
ti.nextEventAt("start") 

} 
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To reiterate, calling spork with both a sonify generator and a 
Timelterator containing the data as shown 

spork(sonify,data) 

will create a pattern comparable to a Chuck-based sonification which 
consists of essentially the same parts: spork a new thread which sets 
up a sound source and mapping strategy, and then loops through a 
conditioned data series, pausing after each data point. 

In Chuck, pausing is written using the syntax 

dur => now; 

whereas the Time Workers equivalent uses 

yield dur 

A yield in the sonify generator loop invokes a JavaScript Promise 
in the Timelterator object whose setTimeout is set to the duration 
to await. 

3.1. SinOsc ugen example 

Custom ugens comprise patch definitions made with the Web Audio 
API's audio nodes. The makeSinOsc example shown here instanti¬ 
ates an oscillator with gain control using the API's createOscillator() 
and createGain() methods [6], 

function makeSinOsc() 

1 

let o = context.createOscillator () 
let g = context.createGain() 
o.type = "sine" 
o.frequency.value = 440 
g.gain.value = 0.1 
o.connect(g) 

g.connect(context.destination) 

g.connect(dac) 

return { osc:o, gain:g } 

1 

The object gets instantiated in a wrapper called SinOsc which 
when instantiated itself with new also includes methods to alter its 
parameters, for example, by changing its frequency with the follow¬ 
ing custom f req () method: 

freq: function (hz) { 
let n = this.dsp 
postMessage(ugensf"["+n+"].osc. 

frequency.setValueAtTime("+hz+", 

"+(myThread.now+cushion)+")") 

} 

this . dsp refers to the ugen itself which is held in the main 
thread’s array ugens [ ]. The message posted to the main thread 
looks up the osc field of the ugen and changes its frequency using 
the Web Audio API’s setValueAtTime (which corresponds to 
the worker thread’s “now” plus a constant offset). A full ugen def¬ 
inition comprises instantaneous setters for all parameters, as well 
as custom time-varying envelopes, for example made with the Web 
Audio API's linearRampToValueAtTime. Note that the patch 
code also includes a connection from the patch’s summing point to a 
global summing point called dac. 

Different sound sources can be made available by expanding the 
library of ugens defined in engine.js. Each would comprise a "make 
the patch” portion and a wrapper (with the ugen name) which in¬ 
cludes the set of parameter altering methods. 


3.2. FM patch 

For example, a simple two-oscillator FM patch could look like the 
following: 

function makeFM() 

1 

let mod = context.createOscillator () 
let modGain = context.createGain () 
mod.type = "sine" 
mod.connect(modGain) 

let car = context.createOscillator() 

let g = context.createGain() 

car.type = "sine" 

modGain.connect(car.frequency) 

car.connect(g) 

g.connect(context.destination) 

g.connect(dac) 

let cFreq = 2200 

let index = 33 

let mRatio = .1 

modGain.gain.value = cFreq * index 
mod.frequency.value = cFreq * mRatio 
car.frequency.value = cFreq 
g.gain.value = 0.1 

return { osc:car, gain:g, mod:mod, modGain: 
modGain } 

1 

All ugens need to be accessible in the timeWorker thread in 
which the sonify loop is running. A last step, then, in ugen creation 
is to add the ugen wrapper, for example FM, to the list of functions 
which gets dynamically installed inline when a new WorkerThread 
is instantiated. 

4. EXTENSIONS 

Changing the sound source, sounding multiple time series and adding 
graphing capabilities are extensions which complement the basic ex¬ 
ample described above 2. 

4.1. Voicing 

Changing to a more interesting sound source is possible in the sonify 
generator itself. This approach relies on combinations of ugens de¬ 
fined in the engine.js script. Where the basic example uses a single 
SinOsc ugen as its instrument, the example here demonstrates ad¬ 
ditive synthesis built by summing multiple sines which are harmon¬ 
ically tuned. The new instrument Harmonics is defined directy 
within the sonify generator. 

function Harmonics(nSins,timeWorker) { 
this.sins = new Array 

for (let i = 0; i < nSins; i++) this.sins. 

push( new SinOsc(timeWorker) ) 

this.sins.forEach(function (x) { x.start () 

}) 

function fi(f,i) { return f*(i+l) } 
function ai(a,i) { let h = (i+1); let odd = 
(h%2) ? a : a*0.1; return odd/h } 

this.setPitch = function(freq) { this.sins, 
forEach(function(x,i) {x.freq( fi(freq, 

i) ) 1) 1 
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this.setGains = function (gain) { this.sins. 
forEach(function(x,i) {x.gain( ai(gain, 

i) ) }) } 

this.freqTarget = function(freq,dur) { this 
.sins.forEach (function (x,i) (x. 
freqTarget( fi(freq,i),dur) }) } 

this.gainTarget = function(gain,dur) { this 

.sins.forEach(function (x,i) (x. 
gainTarget( ai(gain, i) , dur) }) } 

this.stop = function() { this.sins.forEach( 

function (x) {x.stopO )) } 

this.ramps = function (f,a,d) { 

this.freqTarget(f,d) 
this.gainTarget(a,d) 



One of these instruments is then instantiated in the sonfication loop, 
for example, with 

let vox = new Harmonics(8,this) 

to create an harmonic series of 8 SinOscs. Given a pitch frequency / 
function fi(f,i) sets their tunings. Amplitude relationships 
in function ai (a, i) create a clarinet-like structure favoring 
odd harmonics. A convenience function ramps is provided which 
applies frequency and amplitude updates to the entire additive syn¬ 
thesis patch. 

The following set of extensions are turned on or off with flags in 
the index.html file. By default, the withDemo flag is set. Only one 
option is allowed at a time, so remember to set 

withDemo = 0 

before exploring these others. 

4.2. Polyphony from multiple data series 

Multiple time series are interesting to sonify at the same time, for 
example, to hear correlations by ear. Data can be input from two or 
more separate data files as in this example which combines monthly 
USA gross domestic product (GDP) from 1969 to 2016 and global 
CO 2 level for the same period. The curves shown in Figure 4 have 
been normalized to the same range. 



Figure 4: GDP and COz- 


The example landing page, index.html, has a provision for hear¬ 
ing these two playing together, as two independent musical voices. 
Change the state of withDemo and this flag for this to take effect: 

withTwoFiles = 1 

Two data files will now be specified and will spawn two Time- 
Worker threads both using the single sonify generator as defined. In 


this example, one can hear details like the 2008 financial downturn 
and the seasonal flux in global CO 2 . Overall, the two quantities fol¬ 
low a coincident rising trend. 

4.3. Animated Chart 

Similar to the interest in multi-modal data presentation described in 
[14], sonification in the present framework can be combined with 
graphing. Chart.js is a FOSS project for interactive plotting in the 
browser and is integrated into the project by loading a single script 
file (which can be locally sourced for creating a NIRD environment). 

Again, the example landing page, index.html, has a provision for 
demonstrating this extension by changing withDemo and this flag: 

withChart = 1 


— G* (if CD filewhome/cc/cnvgiclatvsonify«lFVByUBOhtmWceMinsoundandAnimacion.hlml ll\ CD % 

Sound and Animation (15 seconds): 

Running 


| u 
1 “ 


mm 

Sea Ice Concentrations from 
NSIDC Passive Microwave Data 
(1979-2015) 

Figure 5: Simultaneous sound and graph of Arctic Sea Ice Minimum 
per year. 

Playing the sonification in Figure 5 animates the black dot on 
the curve. Syncronized sound and animation is accomplished with 
postMessage ( "moveGraph () " ) inside the loop in the sonify 
generator. Each successive call advances the black dot to the next 
data point in an array of 2D data points that was input from a multi- 
column data file (columns are year and value). 

4.4. Real-time FFT display 

Likewise, change withDemo and the following flag in the example 
landing page, index.html, and the sonification’s audio output will be 
displayed as a time-varying spectrum. 

withFFT = 1 

An FFT analyzer computes the spectrum of the global summing 
point dac in real time. 

5. CONCLUSIONS 

A 40+ year tradition has evolved a well-known pattern for sequenc¬ 
ing scores and real-time synthesis in languages like Pla[l], Com¬ 
mon Music[3], Chuck[4] and others. The sonify generator’s loop is 
a descendant written in JavaScript. Running in the browser, it al¬ 
lows flexible programming using the full power of the language and 
can be rapidly experimented with on any browser-equipped system. 
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Table 4: Time Workers framework in terms of goals suggested by Sonification Report [7] 


attribute 

goal 

now 

soon 

never 

Portability 

consistent 


X 


Portability 

reliable 

X 



Portability 

portable across various computer platforms 

X 



Portability 

moving between real-time and non-real-time sound production 


X 


Flexibility 

simple controls for altering the data-to-sound mappings 

X 



Flexibility 

simple “default” methods of sonification that allow novices to sonify their data quick and easily 

X 



Integrability 

easy connections to visualization 

X 



Integrability 

easy connections to visualization programs, spreadsheets, laboratory equipment 



? 

Integrability 

standardized software layer 



? 


Sonifications created using the framwork run equally well on mobile 
and other smaller systems. 

Pla’s voices are analogous to sonify generator loops because they 
constitute groups of time-ordered events which can themselves be 
voices (recall that spork-ed child threads can spork their own chil¬ 
dren). Other pertinent features of Pla also have bearing on the present 
framework (these are distilled a 1983 description): “Higher levels of 
musical control are implemented as voices and sections ...” “...notes 
that somehow belong together are grouped under the rubric of a 
voice.” “Arbitrarily large groups of voices can be organized into a 
section, which then becomes nearly equivalent to a voice.” “Another 
kind of grouping is based on voices... voices can create other voices 
to any level of nesting.” 

Common Music’s similar features involve multiple types: “ Thread 
-A collection that represents sequential aggregation. A single time¬ 
line of events is produced by processing substructure in sequential, 
depth-first order.” “ Merge - A collection that represents parallel ag- 
gregation, or multiple timelines. A single timeline of events is pro- 
ducted by processing substructure in a scheduling queue.” “ Algo¬ 
rithm - A collection that represents programmatic description. In¬ 
stead of maintaining explicit substructure, a single timeline of events 
is produced by calling a user-specified program to create new events.” 

The Time Workers framework described here offers a way to con¬ 
struct the above relationships in browser-based platforms and offers 
solutions for some, but not all of the goals cited in Sonification Re¬ 
port [7], Table 4 lists the boxes it checks off. 

In the future, faster-than-sound soundfile writing will be directly 
supported though for now, file output is only by browser sound cap¬ 
ture plug-ins (which run in real time). Faster-than-sound is a highly- 
desirable feature and is something that’s been supported in both Com¬ 
mon Music and Chuck. Regarding the former, "Realization in Com¬ 
mon Music can occur in one of two possible modes: run time and 
real time. In run-time mode, realized events receive their proper 
"performance time stamp,’ but the performance clock runs as fast 
as possible. In real-time mode, realized events are stamped at their 
appropriate real-world clock time.” For the latter. Chuck’s “silent 
mode” is the equivalent. 

The recently standardized AudioWorklet [15] 1 will be integrated 
into the framework in the coming months. Of particular interest is 
another recently proposed enhancement to Web Audio to support 
multi-channel output. 

Also for the future, direct real-time sonification from live sensor 
data can be contemplated. This important feature opens up appli- 

1 As of this writing, only the Chromium browser family supports Au¬ 
dioWorklet. It is expected soon in Firefox at which point the integration work 
will commence. 


cations such as bio-feedback [16] or other kinds of feedback such 
as providing real-time “cracking” sounds to operators of fracking 
pumps (where presently feedback is provided after the fact and one 
can imagine the problems resulting from the over-stimulation of shale 
gas wells). It has become vital in medical applications, even making 
inroads on traditional treatment practices in cases where listening 
to data provides equal or better sensitivity and specificity compared 
to visual means. The brain stethoscope, for example, allows rapid 
detection of non-convulsive seizures by non-specialists. [17] 

Interest in sonification is burgeoning as sensors and data collec¬ 
tions become an increasingly ubiquitous part of daily life. Employ¬ 
ing well-known sound generation techniques from computer music, 
sonification can play a role in the work of domain experts and stu¬ 
dents in sciences and arts, as well as for general communication. 
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ABSTRACT 

Sequoia is a new software library for musical sequencing, with gen¬ 
erative capabilities and sample-accurate timing. The architecture 
supports a variety of techniques, including polymetric sequencing, 
clock division, probability, and other parameters which can be ma¬ 
nipulated in real time - or even sequenced themselves. The core 
library is written in C and supports JACK MIDI; Python bindings 
are also available. 

1. MOTIVATION 

In recent years, the electronic music community has shown a grow¬ 
ing interest in the use of standalone hardware units, both for studio 
production and live performance [1], Among their many appeals, 
these devices have the advantage of being modular - drum machines, 
synthesizers, samplers, sequencers, mixers, and effects units can be 
connected and re-connected in myriad ways to accomodate a variety 
of workflows. Each component serves a unique role and interfaces 
with other components through well-dehned interfaces: line-level 
audio, and control signals typically in the form of MIDI or CV (con¬ 
trol voltage). 

The Linux audio ecosystem is well-poised to emulate this paradigm 
in software; audio routing libraries like JACK, and control signal 
protocols like MIDI and Open Sound Control (OSC) provide a frame¬ 
work for connecting standalone applications into software “rigs” suit¬ 
able for composition and performance alike. Indeed, such modular¬ 
ity is central to the Unix philosophy: programs should “do one thing 
and do it well” [2], True to form, numerous drum machines (e.g. 
hydrogen, drumkvl), synthesizers (zynaddsubfx, amsynth, dexed), 
samplers (shuriken, qsampler, petri-foo, sooperlooper), mixers (jack- 
mixer, non-mixer), and effects (calf-plugins, guitarix) are available 
from popular Linux repositories. Additional utilities exist for manag¬ 
ing audio/MIDI connections (qjackctl, catia/claudia/carla) and sav¬ 
ing/restoring sessions (lash/ladish/nsm/aj-snapshot). 

Sequencers, however, are comparatively absent from this ecosys¬ 
tem. Perhaps the best-established example is seq24 [3], which, albeit 
stable and relatively comprehensive, has not been significantly up¬ 
dated since 2010, and suffers from usability issues which hinder on- 
the-fly composition. Various sequencers exist within larger DAW ap¬ 
plications like Ardour [4], LMMS [5], Qtractor [6], Rosegarden [7], 
and Muse [8], but these don’t fit into the modular paradigm described 
here. Furthermore, the predominant interface for these software se¬ 
quencers is the piano roll, which is well suited for editing live data 
captured from a MIDI controller, but less appropriate for the quick 
manipulation of drum patterns and arpeggios typical of dance music. 
For this task, a traditional step sequencer is desired. 

But step sequencers can be quite complex. They typically fea¬ 
ture live sequence composition, real-time manipulation, and chain¬ 
ing of sequences. More advanced examples include generative prop¬ 
erties like probability, ratcheting, and meta-sequencing, in addition 


to step-wise parameters like microtiming and control variable modu¬ 
lation. With such a wide variety of features, it can be challenging to 
design applications which cover all the bases - but this is primarily 
a problem of interface design. The essentials of modern sequencing 
- timing, synchronization, live manipulation, etc. - can be separated 
from the problem of application design, and distilled into a general- 
purpose library, as in the “model-view-controller” paradigm [9] This 
is the motivation for Sequoia. 



Figure 1: A Sequoia session is connected to two different client appli¬ 
cations using JACK. Here ZynAddSubFX (zyn-fusion) and drumkvl 
are being used to create a simple beat. Carla is used to manage 
audio and MIDI connections. 


2. DESIGN 

The architecture of Sequoia is based on four object classes: session, 
sequence, trigger, and port. 

A sequence is a discrete series of events which steps in time 
with a metronome. In this sense. Sequoia is a “step sequencer”, but 
events are not required to be evenly spaced in time (see Section 4.1). 
The length of a sequence is the number of steps that the sequence 
contains. There is no limit (aside from memory) to the length of a 
sequence, but once specified (via instantiation), it is fixed. This is 
less of a constraint than it may seem, however, as sequences can be 
chained together and “meta-sequenced” dynamically 5.3. Sequences 
have several dynamic parameters: the mute state, transpose, clock 
division, playhead position, playhead direction, and loop boundaries 
can all be modified live during playback. 

Triggers (or "trigs” for short) are the event objects which may 
populate the steps of a sequence. They store information depending 
on their type; the current trigger types are: 

• Null: (an empty trig) 

• Note: note value, velocity, length 
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• CC: number, value 

Each trigger also carries a channel number, a probability and a mi¬ 
crotime. Microtime is a floating-point value in the range [—0.5, 0.5), 
where the units are in steps. Thus a trigger can be placed half a step 
before or after its nomimal timing, allowing for irregular rhythms, 
“humanization”, and swing. 

Sequences run within a Sequoia session, which controls the tempo 
and transport (start/stop/pause) state applied to all contained sequences. 
A session can have a number of ports for communicating with other 
applications - including other Sequoia sessions. The ports can be 
input (“inports”) or output (“outports”), have descriptive names, and 
can be assigned to sequences individually, or on a many-to-one ba¬ 
sis. For example, we may have 4 sequences (kick, snare, closed hat, 
open hat) feeding into a single outport called “drums”, while another 
melodic sequence feeds into an outport called “synth” - all sequenc¬ 
ing in time within the same session. 

3. API 

Sequoia is implemented as a C library in the “object-oriented” style: 
data structures are presented as custom types with associated meth¬ 
ods for instantiation and mutation. All library functions and data 
types are prefixed with sq_*. The full API is documented on the 
associated GitHub wiki; here we present a simple example which 
constructs and plays a 2-note sequence: 

finclude "sequoia.h" 

#define STEP_RES 256 
int main(void) { 


Here, STEP_RES is the step resolution, in ticks per step. This needs 
to be the same for all sequences in the session - attempting to add 
a sequence with incompatible step resolution will result in an error. 
We create an outport for the session called “My Port” and set the 
sequence to output events through it. We then create a placeholder 
trigger object trig and use it to populate the sequence. Finally, we 
add the sequence to the session, set the BPM, and start sequencing. 

3.1. Python Bindings 

The main C library is augmented with Python bindings which obey 
a direct mapping between classes and methods. In Python, the ex¬ 
ample above could be written as: 

import sequoia as sq 
STEP_RES = 256 

sesh = sq.session("My Session", STEP_RES) 
seq = sq.sequence(16, STEP_RES) 
port = sesh.create_outport("My Port") 
seq.set_outport(port) 

trig = sq.trigO 

trig.set_note(60, 100, 4) 
seq.set_trig(0, trig) 
trig.set_note(67, 100, 4) 
seq.set_trig(8, trig) 

sesh.add_sequence(seq) 
sesh.set_bpm(120) 
sesh.start() 


sq_session_t sesh; 

sq_session_init(Ssesh, "My Session", 
STEP_RES); 

sq_sequence_t seq; 

sq_sequence_init(&seq, 16, STEP_RES); 
jack_port_t *port; 

port = sq_session_create_outport(Ssesh, 
"My Port"); 

sq_sequence_set_outport(&seq, port); 


4. IMPLEMENTATION 

A Sequoia session registers as a JACK external client whose name 
is the session name (specified during instantiation). Input and output 
ports are created as JACK MIDI ports (also named) which are served 
by the JACK processing callback. The API is compiled into a shared 
library plus header files, and can be installed e.g. in /usr/local/ for 
dynamic linking across multiple applications. 

4.1. Timing 


sq_trigger_t trig; 
sq_trigger_init(Strig) ; 


sq_trigger_set_note(Strig, 
sq_sequence_set_trig(&seq, 
sq_trigger_set_note(&trig, 
sq_sequence_set_trig(&seq, 


60, 100, 4); 
0, Strig); 
67, 100, 4); 
8, Strig); 


sq_session_add_sequence(Ssesh, Sseq); 
sq_session_set_bpm(Ssesh, 120); 
sq_session_start(Ssesh) ; 


return 0; 

} 


Timing is managed by the JACK processing thread as it executes 
within the context of the Sequoia session. The session keeps track 
of the frame count as it works to fill the JACK buffer with time- 
stamped MIDI events. Events are managed by the sequences which 
handle time as a grid of microticks - intervals of time much shorter 
than the step length which enable the microtiming functionality of 
the sequencer. In the code example in Section 3, the mictrotiming 
resolution is set to 256 ticks per step. In theory, this resolution can 
be set much higher, though in practice, it will be limited by CPU 
performance. The number of frames per tick (fpt) is: 


fpt = 15 *- 

tps * bpm 


( 1 ) 


where sr is the sample rate, tps is the step resolution (ticks per step), 
and bpm is the tempo in beats per minute. At 48 kHz with 256 
ticks-per-step, there are 23 frames-per-tick at 120 BPM. At 4096 
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Figure 2: Diagram visualizing the 3-tiered timing scheme used by Sequoia. At the highest level there are steps: 4 steps per beat (in the sense 
of “beats per minute”), and one trig per step. Going down one level, each step is composed of several “microticks” which comprise the grid 
for microtiming events. Here, only 8 microticks per step are shown for clarity, but a typical sequence may have 256 (or more) ticks per step. 
Finally, there is the frame counter, which sweeps between the microticks until it reaches a tick boundary, at which point a trigger may be fired. 


ticks-per-step, this becomes 1 frame-per-tick, which is the theoretical 
maximum resolution for this tempo and sample rate. 


4.2. Trig-to-Microtick Translation 

Although the fundamental timing grid is managed at microtick reso¬ 
lution, this implementation detail is hidden from the user by the trig 
interface. The user manages the sequence data by setting its trigs 
(one for each step); these trigs are then placed on the microgrid ac¬ 
cording to their microtiming. The formula is: 

tick index = (step + /dime) * tps (2) 

At this tick index, we place a pointer to the trig, which allows us to 
look up both the trig parameters (e.g. probability, length) and the 
sequence parameters (e.g. mute, transpose) at trig time, to ensure 
that we send the correct MIDI event at the correct time. 


4.3. Note-Off 

While note-on and control change events are recorded in the micro¬ 
grid at composition time (i.e. when the user calls 
sq_sequence_set_trig ()), note-off events are managed dif¬ 
ferently. To see why, consider what would happen if a C note of 
length 4 steps was recorded in the microgrid as a C-note-on plus a 
C-note-off 4 steps later. Now imagine if the sequence transpose pa¬ 
rameter were changed in the middle of that note. The note-off would 
be delivered for the wrong note value, and the synthesizer down¬ 
stream would be left with a hanging note. The same applies for play¬ 
head manipulation, or any number of the other sequence parameters 
which support live control. 

The solution is to implement for each sequence a separate ring 
buffer, specifically for note-offs, which is always running forward. 
The length of this buffer is the maximum note length, which is also 
the length of the sequence. The buffer gets populated with a note- 
off (at the appropriate delay) whenever a note-on fires. When the 
note-off is reached by the advancing buffer pointer, it is fired, and 
then removed from the buffer. When a sequence (or the session) is 
stopped, we can optionally call a “clean” command, which sweeps 
through the off-buffer as quickly as possibly, delivering all remaining 
note-offs. 


4.4. Lock-Free Parameter Control 

In a running Sequoia session, the JACK thread needs immediate ac¬ 
cess to data that other threads (e.g. the UI thread) can manipulate 
during playback. In a non-realtime application, this would be ac¬ 
complished with mutex locks [10], but in realtime audio, this is un¬ 
acceptable - the audio callback must never execute code that could 
block for an indeterminate amount of time [11], In lieu of mutex 
locks, we synchronize data between threads via lock-free message 
queues. For this, we use jack_ringbuf fer_t as offered by the 
JACK API. We then implement a simple messaging protocol that al¬ 
lows for the UI thread to “set” or “get” critical data when the audio 
thread enters the processing callback. This allows both threads to 
access the data while avoiding any race conditions. 

Message queuing offers a clean solution when the audio thread 
is running, but it can present problems when the system is in a dor¬ 
mant state. In this situation, for example, a queueing “getter” method 
would block indefinitely, waiting for the processing callback to serve 
the request. As another example, a user will commonly populate a 
sequence with trigs before adding it to a running session. If the se¬ 
quence length is longer than the message queue, this would overflow 
the buffer and cause an error. 

Ideally, the getters and setters would access data directly when 
operating on a dormant structure, and use message queues when the 
sequencer is running. In Sequoia, this branching behavior is handled 
automatically - the data access methods are polymorphic according 
to the running state of the system. 

5. GENERATIVE TECHNIQUES 

In addition to serving as a streamlined API for general-purpose, time- 
critical sequencing with real-time control, Sequoia has been designed 
from the ground-up with generative music techniques in mind. Here, 
we describe just a few of these possibilities which Sequoia enables. 

5.1. Polymeter 

Since there’s no concept of a global step counter in Sequoia (only the 
per-tick frame counter managed by the session), sequences are free 
to run in and out of phase with each other, according to the least- 
common-multiple of their lengths. For example, a 16-step sequence 
played against a 15-step sequence will evolve through 240 steps of 
variation before syncing back up and repeating itself. 
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5 . 2 . Probability 

Trig parameters include probability, a floating-point value in the 
range [0,1] which determines what fraction of the time a trig actually 
fires. This applies to both note-type and CC-type triggers. 
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Figure 3: Meta-sequencing. A Carla patch showing a slow mod¬ 
ulation sequence controlling the transpose parameter of a melody 
sequence, which is driving the synthvl synthesizer. 


5 . 3 . Meta-sequencing 

Meta-sequencing, simply put, is “sequences sequencing sequences”. 
Any of the sequence parameters - playhead, loop start, loop stop, 
playback mode, transpose, mute state, clock divide - can be con¬ 
trolled live from Sequoia’s MIDI-in ports. The way MIDI events 
map to parameter controls is determined by a mapping defined by 
the user upon sequence creation. 

Combined with the concepts described above, this technique can 
be very powerful - a single, monophonic sequence can be manipu¬ 
lated by another (perhaps employing polymeter, probability, or clock 
division) to generate a much longer, stochastically evolving sequence 
(see Figure 3). Sequences can even be looped back into themselves 
to give surprising results (Figure 4) - although care must be taken in 
this case to avoid runaway conditions. 



5.4. Algorithmic Control 

Obviously, the facility of inports and controller mappings allows for 
external clients (e.g. Python scripts, Pure Data patches, Geiger coun¬ 
ters with USB connections...) to control sequence parameters in any 
way one might wish, thus allowing a huge variety of algorithmic 
methods to modulate the sequencer. 

6. STATUS 

Sequoia is currently in active development. The core library (libse- 
quoia) is in a viable state, and the source code is available on GitHub 
under the GPL license (v3) [12]. We are also in the process of em¬ 
bedding the library within Ziggurat, an existing GUI sequencer ap¬ 
plication [13]. Future work will focus on developing bindings to 
other languages, and improving documentation. 
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Figure 4: Auto-sequencing. A melody sequence is fed back into itself 
(notice the looped-back red line from synth to input on the melody 
client), and the result is used to drive synthvl. Depending on the 
melody and the input mapping, this situation can “run away ” to infi¬ 
nite pitch. If it doesn’t, the results can be a surprising transformation 
of the original melody. 
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ABSTRACT 

This paper presents the evaluation of a media clocking scheme in 
an AVB network segment. The JACK audio connection kit on each 
AVB processing server is synchronized to an IEEE 1722 media clock 
stream, as well as each UDP Soundjack receiver on each AVB proxy 
server. Thus, the transmission of each packet of an audio stream 
is bound to the transmission interval of the media clock stream and 
each participant is able to recover the same media clock. In this 
paper we present the evaluation of this media clocking scheme and 
the JACK client synchronization with the AVB network segment at 
hand. 

1. INTRODUCTION 

Soundjack [1] is a real-time communication software using peer to 
peer connections, to connect up to five participants to each other. 
This software was designed as a tool for musicians and was first 
published in 2006 [2], The interaction with live music over the pub¬ 
lic Internet is very sensitive to latencies, both round trip as well as 
one-way. Thus, this application is mainly concerned with the mini¬ 
mization of latencies as well as jitter. 

1.1. fast-music and Soundjack 

In cooperation with the two companies GENUIN [3] and Symon- 
ics [4], a rehearsal environment for conducted orchestras via the pub¬ 
lic Internet is under development as the goal of the research project 
fast-music. Up to 60 musicians and one conductor, who are ran¬ 
domly distributed throughout Germany, shall be able to play together 
live. The central node represents the multimedia signal processing 
server network under investigation, which ideally will be located in 
Frankfurt on the Main, since it is the largest Internet exchange node 
in Germany it promises the smallest round trip latencies. 

1.2. Concept for a Real-time Processing Server Network 

The basic signal processing functionality of the server network con¬ 
nects up to 60 UDP streams to each other and mixes them. A single 
server could easily handle mixing this amount of concurrent UDP 
streams with reasonably low latency, but for future research in the 
application of immersive audio technologies in real-time, a single 
server is not sufficient to handle the computational load of 60 indi¬ 
vidual audio and video streams. Thus, a scalable infrastructure is 
chosen to provide such signal processing capacities. The signal pro¬ 
cessing provided by the Soundjack server network involves mixing 
algorithms for audio and video streams. As an infrastructure for the 
audio signal processing stage, the JACK [5] audio server is deployed. 
JACK is a professional and open source audio server, that allows ap¬ 
plications to share sample accurate audio data with each other. A 


large number of signal processing applications and algorithms are 
available for JACK. Details on the mixing application can be found 
in [6]. 

Another benefit of such a scalable approach is the minimization 
of service times of network packets, which is the time a packet re¬ 
quires to travel on the wire until it is fully held in the input buffer 
of the servers network interface. During the service time of a sin¬ 
gle network packet, no concurrent packets can be processed, which 
may introduce some hold time in the upstream buffer of each con¬ 
current stream, adding to the overall round trip time. The reduction 
is not significant. The test environment considered in this paper is 
the Ethernet based campus network of the university. 

A detailed description of the first design of the software architec¬ 
ture and operating system configuration can be found in [7], Recent 
findings however, have revealed the first design to be flawed and not 
fully capable of providing the required features. A new software ar¬ 
chitecture is under development. The results presented in this paper 
however, are not influenced by the rework of the software architec¬ 
ture since the JACK server is running independently. 

1.2.1. Audio Video Bridging - an Open Standard Solution 

Audio Video Bridging / Time-Sensitive Networking (AVB/TSN) de¬ 
scribes a set of IEEE 802.1 standards that operate on layer two of the 
OSI model [8], These standards enable computer networks to handle 
audio and video streams in real-time. Operating only on OSI layer 
two, AVB is not routable. It is defined for local network segments 
only. 

• IEEE 802.IAS [9] 

Timing and Synchronization for Time-Sensitive Applications 
in Bridged Local Area Networks (referred to as gPTP) 

• IEEE 802.1Qat [10] 

Virtual Bridged Local Area Networks - Amendment 14: Stream 
Reservation Protocol (SRP) 

• IEEE 802.1Qav [11] 

Virtual Bridged Local Area Networks - Amendment 12: For¬ 
warding and Queueing Enhancements for Time-Sensitive Streams 
(FQTSS) 

• IEEE 1722 [12] 

IEEE Standard for Layer 2 Transport Protocol for Time-Sensitive 
Applications in Bridged Local Area Networks (referred to as 
AT VP) 

• IEEE 1722.1 [13] 

IEEE Standard for Layer 2 Transport Protocol for Time-Sensitive 
Applications in Bridged Local Area Networks (referred to as 
AVDECC) 
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The AVB standards are extensions for generic Ethernet networks 
providing precise synchronization, resource reservation and band¬ 
width shaping. Lower latencies and jitter, the avoidance of packet 
bursts and bandwidth shortage are addressed, providing real-time re¬ 
sponsiveness to a computer network. These properties are used to 
ensure a constant streaming with low latency and jitter inside the 
Soundjack server network. Thus, the Soundjack client streams can 
be processed inside the server network, without interfering with each 
other. 

AVB networks require special hardware for timestamping Eth¬ 
ernet frames with separate transmission queues for each traffic class, 
i.e. AVB traffic with Stream Reservation (SR) classes A/B and generic 
Ethernet traffic. The IEEE 802.1-2014 [14] standard defines the two 
stream reservation (SR) classes A and B. Both classes are used in an 
SRP domain to differentiate audio and video traffic front other Eth¬ 
ernet traffic. For SR class A, SRP reserves resources on all switch 
ports along the path front talker to listener to maintain a transmission 
interval of 125 fcs (250 /rs for SR class B). The implications of the 
transmission interval are discussed in section 2. 

1.2.2. Network Synchronization with gPTP 

The precise synchronization of different devices spread throughout 
a local area network requires a specialized protocol, i.e. PTP, which 
involves several steps. Each time a gPTP capable device appears on 
the network segment, a negotiation for the grand master role is trig¬ 
gered. The best master clock algorithm (BMCA) compares the clock 
information in announce messages, that are broadcasted by each PTP 
capable device on the same clock domain. A clock domain is a part 
of a network segment that is synchronized to the chosen grand master 
clock, it is separated by devices or Ethernet bridge ports that are not 
gPTP capable (gPTP is a special profile [9] for PTP [15]). Each gPTP 
capable Ethernet bridge port has a mode of its own, either master or 
slave. The Ethernet port of the AVB device running the grand mas¬ 
ter clock is in master mode and is the root of the hierarchical clock 
distribution. The bridge port of the AVB switch it is connected to, is 
in slave mode. It receives clocking information rather then sending 
it. Since the switch receives its gPTP clock from this bridge port 
in slave mode, all its other bridge ports are in master mode. They 
distribute the clock information of the grandmaster clock to the next 
hop or AVB device. 

After this election process, the clock domain needs to be syn¬ 
chronized. This is achieved in two steps: Syntonization, and Offset 
and Delay Measurement. In the first step “SYNC” messages are send 
from the master to the slave port followed by a “Follow_Up” mes¬ 
sage, which includes a timestamp taken close to the media (physical 
layer) of the sender. Both messages are used to adjust the frequency 
of the slave to the master clock. The second step involves “Pde- 
lay_Req” and “Pdelay_Resp” messages and measures the absolute 
time offset between the master clock and slaves local clock. The 
slave port adjusts its local clock to match the master clock. After this 
procedure each network device is synchronized to the grandmaster 
clock, matching its phase and frequency. For the exact mechanisms 
and calculations see [9] and [16]. 

1.2.3. Control Messages and SO_TIMESTAMPING 

The CMSG macros are used by the operating system to create and 
access control messages, which are also called ancillary data, that 
are not provided by the generic payload of a raw Ethernet socket. 


This additional control information includes among other things the 
receiving interface, optional header fields, extended error description 
or a set of file descriptors. Ancillary data is sent with sendmsg (), 
received with recvmsg () and is stored as a list of struct cmsghdr 
structures with data appended to it. The use case at hand is to 
receive the hardware timestamp of the arrival of each AVTP packet. 

The userspace interfaces to receive timestamped network packets are 
the following [17]: 

• SO_TIMESTAMP: 

Generate timestamp with microseconds resolution for each 
incoming packet using the system time. 

• SO_TIMESTAMPNS: 

Generate timestamp with nanoseconds resolution for each in¬ 
coming packet using the system time. 

• SO_TIMESTAMPING: 

Generate timestamp with nanoseconds resolution for each in¬ 
coming packet using the network hardware. 

The SO_TIME STAMP ING interface has to be configured on the raw 
Ethernet socket with set sockopt () and the appropriate flags have 
to be chosen from the following: 

1. Determine how timestamps are generated with 

SOF_TIMESTAMPING_TX/RX: 

• S OF_TIME S TAMPING_TX_HARDWARE: 

Hardware transmission timestamp. 

• S OF_TIME S TAMPING_TX_S OF TWARE: 

Fallback in case of failure of 

SOF_TIME S TAMPING_TX_HARDWARE. 

• S OF_TIME S TAMPING_RX_HARDWARE: 

Original, unmodified reception timestamp, generated by 
the hardware. 

• S OF_TIME S TAMPING_RX_S OF TWARE: 

Fallback in case of failure of 

S 0F_TIME S TAMPING_RX_HARDWARE. 

2. Determine how timestamps are reported in the control mes¬ 
sages with SOF_TIMESTAMPING_RAW/SYS: 

• SOF_TIMEST AMPING_RAW_HARDWARE: 

Return raw hardware timestamp. 

• SOF_TIMEST AMPING_S Y S_HARDWARE: 

Return hardware timestamp converted to the system time. 
The correlation between the transformed hardware times¬ 
tamps and the system time is as good as possible, but 
not perfect. Requires support by the network device 
and will be empty without that support. 

• S OF_TIME S TAMPING_S OF TWARE: 

Return software timestamp. 

In addition to the setsockopt (), it is necessary to initialize the 
device driver to do hardware timestamping with an ioctl () -call 
to SIOCSHWTSTAMP. The ioctl () has to be called with the ar¬ 
gument: 
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struct hwtstamp_config { 
int flags; 
int tx_type; 
int rx_filter; 

} ; 


Possible values for hwtstamp_config->tx_type are: 

• HWTSTAMP_TX_OFF: 

Deactivate hardware timestamping for outgoing packets. 

• HWTSTAMP_TX_ON: 

Activate hardware timestamping for outgoing packets is turned 
on. The sender decides which packets are to be time stamped. 

Possible values for hwtstamp_conf ig->rx_f ilter are: 

• HWTSTAMP_FILTER_NONE: 

Deactivate timestamping for incoming packet. 

• HWTSTAMP_FILTER_ALL: 

Activate timestamping for any incoming packet. 

• HWTSTAMP_FILTER_SOME: 

Activate timestamping all requested packets plus some more. 

• HWTS TAMP_F ILTER_PTP_V1_L4_EVENT: 

PTP vl. UDP, any other event packet. 

1.2.4. Hardware Configuration 

Two server types with real-time capabilities are designed for the 
Soundjack server network, an AVB proxy server and an AVB pro¬ 
cessing server. Both server types are running on a x86_64 architec¬ 
ture with eight physical cores and are equipped with an Intel 1210 
network interface card [18]. A open source driver stack that is re¬ 
quired to compile the kernel module (igb_avb . ko) with AVB sup¬ 
port is available at Github [19]. The gPTP daemon, which is used in 
this setup, is also provided by this repository. All AVB servers of 
both types are registered for a media clock stream, which is supplied 
by an XMOS development board manufactured by Atterotech [20], 
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2. IEEE 1722 AVTP MEDIA CLOCK SYNCHRONIZATION 
CONCEPT FOR THE JACK AUDIO CONNECTION KIT 

The signal processing concept is designed for a completely digital 
signal chain, i.e. neither analog-digital (ADC) and nor digital-analog 
converters (DAC) are present. Without the local clock of an ADC 
the processing server would have no media clock source to synchro¬ 
nize to. Consequently, it is not possible to adjust the local clock 
to match the gPTP grandmaster clock. With a media clock stream 
as clock source, no additional hardware besides the network inter¬ 
face is required. The media clock stream maintains a constant media 
clock originating from a gPTP derived word clock of the ADC on the 
XMOS development board. The ADC of the XMOS development 
board is running at a sampling rate of 48 kHz and is configured as an 
AVB talker. It automatically acknowledges any connection request 
of a listener, without the use of IEEE 1722.1 ACMP. The different 
clock source concepts are explained in [16] in detail. 


2.1. Packet Rate and Padded AVTP Packets 


The transmission interval of 125 ps, that is defined by the SR Class 
A, has the same constant transmission interval for higher sampling 
rates as well. Instead of sending packets in a shorter interval, the 
amount of samples per packet is adjusted. For a sampling rate of 
48 kHz six samples per audio channel are written to an AVTP packet 
(12 and 24 samples for 96 kHz and 192 kHz respectively): 

6 samples 

125 us = --- => 8 packets per millisecond (1) 

48 kHz 

This way the transmission interval can maintain the media clock of 
the talker for the listener to recover. Figure 1 shows the packet rate 
of 8 packets per millisecond of the media clock stream originating 
from the XMOS development board. Figure 2 shows the probabil¬ 
ity distributions of the transmitted AVTP packets of the media clock 
stream, measured on the processing server with hardware packet ar¬ 
rival timestamps. The calculated mean value of 124997 ns and stan¬ 
dard deviation of 309.35 ns meet the defined transmission interval 
for a SRP class A domain of 125 ps perfectly. 

In section 3 we will evaluate the three JACK period sizes of 32, 
64 and 128 samples. The remaining samples of a JACK period, that 
occur since six (samples per AVTP packet) is not an integer divisor 
of either 32, 64 nor 128, are calculated in equation (2): 


N samples per JACK period j 

6 samples per AVTP packet I 


= k packets per JACK period (2) 


Samples 

AVTP Packets 

32 

rfi = rs+n = 6 

64 

rfi = rio+n = n 

128 

r^l - T21 + |1 = 22 


Figure 1: Packet rate of the IEEE 1722 AVTP media clock stream 
originating from the XMOS talker. The MRP client of the 
JACK media clock backend has established the 
connection to the XMOS talker after 12 seconds. The 
figure is enhanced and clipped at 60 seconds to show the 
anomalies (packet rates of 7 and 9 packets per 
millisecond) between around 30 and 50 seconds. 


Table 1: Samples and packets per JACK period 

This means that for 32 samples per period every 6th AVTP packet 
carries a fraction of the six samples, in this case 1/3 = 2 samples, 
and the remaining four samples are padded with zeros - for 64 sam¬ 
ples every 11th packet has four samples, the rest is padded with zeros 
and for 128 samples every 22th packet has two samples and the rest 
is padded with zeros. 
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Transmission Interval [ns] 

Figure 2: Probability distribution function of the IEEE 1722 AVTP 
media clock stream originating from the XMOS talker. 

The measurement shows the network hardware receive 
timestamps on the server side. 

2 . 2 . AVB Listener as JACK Media Clock Backend 

The media clock listener is the same as in the AVB server implemen¬ 
tation [7] and is integrated by a C++ wrapper that was inspired by 
Netjack [21], i.e. only the Read () (and Write (), which is re¬ 
quired for proper operation) member functions are used to advance 
the JACK server according to the configured sample rate. As an ad¬ 
ditional configuration, the JACK AVTP backend is required to run 
a dummy stereo channel setup, because JACK clients could not be 
activated otherwise. 

int init_1722_driver( 

IEEE 1722_avtp_driver_state_t *IEEE 1722mc, 

const char* name, 

char* stream_id, 

char* destination_mac, 

int sample_rate, 

int period_size, 

int num_periods 

) 

Called with the appropriate arguments, the initialization proce¬ 
dure starts a MRP thread, which takes care of the resource reserva¬ 
tions for the media clock listener. After the Listener has established 
the path to the media clock talker and the JACK server has started, 
the backends’ Read () member function calls the wrapped proce¬ 
dure: 

uint64_t media clock_listener_wait_recv_ts( 

FILE* filepointer, 

IEEE 1722_avtp_driver_state_t **IEEE 1722mc, 
struct sockaddr_in **si_other_avb, 
struct pollfd **avtp_transport_socket_fds, 
int packet_num 

) 

This procedure is blocking until an AVTP media clock packet 
arrives. The struct pollfd was used to keep blocking and non- 



Figure 3: Different kernel and userspace layers involved in the 

JACK media clock backend. The socket is filtered with a 
Berkeley Packet Filter (BPF)for the correct destination 
MAC address, Ethernet type and IEEE 1722 message type 
of the media clock stream packets. The stream ID is 
filtered after an AVTP packet is received in userspace. 


blocking procedure signatures consistent, since the AVB server’s 
main process also uses a media clock listener. 

The raw Ethernet socket, that is used to receive the media clock 
stream, has the socket option SO_TIMESTAMPING set to: 

ts_flags |= SOF_TIMESTAMPING_RX_HARDWARE; 
ts_flags |= SOF_TIMESTAMPING_SYS_HARDWARE; 
ts_flags |= S OF_TIME S TAMPING_RAW_HARDWARE; 

The network device driver is configured to timestamp any incoming 
packet with a struct hwtstamp_conf ig set to: 

hwconfig.rx_filter = HWTSTAMP_FILTER_ALL; 
hwconfig.tx_type = HWTSTAMP_TX_ON; 

Experience has shown that HWTSTAMP_TX_ON has to be switched 
on for the reception of the media clock stream packets, even though 
the socket is not used for transmission, because the gPTP system 
service is effected otherwise and loses its synchronization to the PTP 
master. 

Considering the following code listing, after the received packet 
was copied to the userspace buffer struct msghdr msg with the 
recv_msg() system call, the ancillary data in struct msghdr 
msg is accessed in line 8. Initially, the macro CMSG_FIRSTHDR 
returns a pointer to the first field of the ancillary data and stores it 
in struct cmsghdr *cmsg. As long as there is ancillary data 
available, the while-loop in line 9 cycles over the ancillary data of 
the received message. When a SO_TIMESTAMPING field is en¬ 
countered, the pointers to the hardware timestamp and the hard¬ 
ware timestamp converted to system time are stored. The packet 
arrival time in nanoseconds is converted from struct timespec 
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to unsigned int 64 and stored in the variable 
pkt_arrival_ts_ns inline 17. 

The current transmission interval of the packet is calculated after 
the while-loop in line 25, the timestamp last_pkt_ts_ns of the 
last packet is subtracted front the timestamp pkt_arrival_ts_ns 
of the current packet. In line 26 the current timestamp is stored for 
the next packet as last timestamp. 

The variable pkt_num is an argument of the procedure and sup¬ 
plied by the driver backend indicating the current packet number in 
the JACK period. When pkt_num matches the calculated packet 
numbers front table 1, a zero padded packet is sent. 

If the pkt_num counter reaches the 6th, 11th or 22nd iteration, 
ad j_pkt_ts_ns is calculated in line 30 to precisely adjust the 
JACK period. The remaining (modulus) samples of the JACK period 
divided by six samples per channel per AVTP packet, is divided by 
the sample rate and then scaled to nanoseconds in unsigned int 
64 representation. This calculation accounts for the padded AVTP 
packets calculated in table 1. The procedure returns 
ad j_pkt_t s_ns to the backend driver, which can adjust the JACK 
period accordingly. 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 


struct msghdr msg; 
struct cmsghdr *cmsg; 
uint64_t current_tx_int_ns = 0; 
uint64_t last_pkt_ts_ns = 0; 

- 8 <- 

recv_msg(..., &msg, ...) 

- 8< - 

cmsg = CMSG_FIRSTHDR(&msg) ; 
while ( cmsg != NULL ) { 

if(cmsg->cmsg_level == SOL_SOCKET 

&& cmsg->cmsg_type == SO_TIMESTAMPING){ 
struct timespec *ts_dev, *ts_sys; 
ts_sys = ((struct timespec *) 

CMSG_DATA(cmsg))+l; 

ts_dev = ts_sys + 1; 

pkt_arrival_ts_ns = ts_dev->tv_sec 
* 1000000000LL 
+ ts_dev->tv_nsec); 

break; 

} 

cmsg = CMSG_NXTHDR(&msg, cmsg) ; 

} 

current_tx_int_ns = pkt_arrival_ts_ns 

- last_pkt_ts_ns; 

last_pkt_ts_ns = pkt_arrival_ts_ns; 

if( pkt_num == (*IEEE 1722mc)->num pkts —1){ 
adj_pkt_ts_ns = (uint64_t) ( 

( ((*IEEE 1722mc)->psize % 6 ) / 
(*IEEE 1722mc)->srate ) * 
1000000000LL); 

} 

return current_tx_int_ns - adi pkt ts ns; 


3. EVALUATION 

The quality of the synchronization to the media clock stream may be 
analyzed in terms of the variation between the points in time, when 
a JACK client is triggered and when a media clock stream packet is 
received. We basically observe, how many media clock stream pack¬ 
ets are received between two successive calls of the JACK backend 
to the client's process callback function. The AVTP backend is based 
on counting the media clock stream packets, thus it is implicitly syn¬ 
chronized to the media clock stream source. The ALSA backend is 
not implicitly synchronized to the media clock stream source, which 
is the reason for the development of the AVTP backend. A synchro¬ 
nization would also be possible, since the media clock source and 
the servers are synchronized to the gPTP network clock. The me¬ 
dia clock source of the XMOS development board drives its audio 
codec with a phase locked loop that locks onto the gPTP network 
clock. The local sample clock of an audio device connected to a 
server would also require a phase locked loop that is fed into the au¬ 
dio device or a continuing calculation and adjustment between the 
network and the audio time. 

The “simple_client.c” example from the JACK source tree has 
been modified to make a system call to the system clock, which 
is synchronized to gPTP, with CLOCK_REALTIME every time the 
JACK process callback is triggered. The measured timestamps are 
written in the JACK shutdown callback function to file. Simultane¬ 
ously, the JACK AVTP backend writes the timestamps from the an¬ 
cillary data to file, as soon as JACK is shut down. In order to be able 
to compare the client activation times of the ALSA backend with 
those of the AVTP backend, a common time source is required. In¬ 
stead of a local audio time that is adjusted to gPTP. we use the media 
clock stream as common time source. The JACK server is launched 
twice for this reason, one instance running with the ALSA backend 
and the measurement client, and a second instance running only with 
the AVTP backend to measure the media clock stream. The server 
was connected to a Focusrite Solo Gen2 USB audio interface [22], 
when the ALSA backend was measured. 

The measurements were conducted with 32, 64 and 128 sam¬ 
ples per JACK period with a sample rate of 48 kHz over a dura¬ 
tion of five minutes, producing between « 10 5 and ss 5 • 10 5 client 
activations, depending on the period size. Furthermore, the AVTP 
backend was measured with two different configurations. In the first 
configuration, the differences of the successive packet arrival times 
are accumulated, as it was explained in subsection 2.2 (AVTP Ad¬ 
just). In a second configuration, a constant difference of 125,000 
(nanoseconds) is added each time, a media clock stream packet ar¬ 
rives (AVTP Const). No buffer over- or underrun occurred in any of 
the JACK backend configurations. The results of the measurements 
are shown in table 2. 


4. DISCUSSION 

Table 2 confirms the primary motivation for the development of the 
JACK AVTP backend, the ALSA measurements for each sample pe¬ 
riod shows a broad distribution of client activation times, which is 
further emphasized by its average and standard deviation. The ex¬ 
pected value is not met in any configuration and the deviation is sig¬ 
nificantly higher than with AVTP. This results in a JACK client and 
a backend, which are not synchronized to the media clock. The re¬ 
quired media clock stream packets per JACK period front table 1 are 
hardly met. 
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Media Clock 
Stream Packet 
Count 

3 

AVTP Adjust 

12 Samples 
AVTP Const 

ALSA 

JACK Cliei 
6 

AVTP Adjust 

it Activation Co 
4 Samples 

AVTP Const 

unt 

ALSA 

i: 

AVTP Adjust 

!8 Samples 
AVTP Const 

ALSA 

1 

14099 

0 

15012 

0 

5936 

0 

0 

0 

0 

2 

0 

0 

19 

0 

0 

0 

0 

0 

0 

3 

1 

0 

32242 

0 

0 

0 

0 

0 

0 

4 

3 

1 

119103 

0 

0 

0 

0 

0 

0 

5 

16353 

15328 

7022 

0 

0 

0 

0 

0 

0 

6 

437342 

406581 

266913 

0 

0 

5437 

0 

0 

0 

7 

16416 

15392 

34865 

0 

0 

61360 

0 

0 

0 

8 

4 

1 

18 

0 

0 

17693 

0 

0 

0 

9 

1 

0 

0 

2 

1 

282 

0 

0 

0 

10 

0 

0 

0 

8757 

3275 

9 

0 

0 

0 

11 

0 

0 

0 

204408 

211261 

2210 

0 

0 

0 

12 

0 

1 

0 

8817 

3337 

95166 

1 

0 

0 

13 

0 

0 

0 

2 

0 

70530 

0 

0 

9 

14 

0 

0 

0 

1 

0 

1634 

0 

0 

2332 

15 

0 

0 

0 

0 

0 

0 

0 

0 

36583 

16 

0 

0 

0 

0 

0 

0 

0 

0 

2562 

17 

0 

0 

0 

0 

0 

0 

0 

0 

7 

18 

0 

0 

0 

0 

0 

0 

0 

0 

0 

19 

0 

0 

0 

0 

0 

0 

1 

0 

0 

20 

0 

0 

0 

0 

0 

0 

0 

0 

0 

21 

0 

0 

0 

0 

0 

0 

2824 

3901 

0 

22 

0 

0 

0 

0 

0 

0 

104961 

107485 

88 

23 

0 

0 

0 

0 

0 

0 

2891 

3969 

10814 

24 

0 

0 

0 

0 

0 

0 

0 

0 

61739 

25 

0 

0 

0 

0 

0 

0 

0 

0 

10292 

26 

0 

0 

0 

0 

0 

0 

0 

0 

54 

27 

0 

0 

0 

0 

0 

0 

0 

0 

0 

28 
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0 

0 

0 

0 

0 

0 

0 

0 

29 

0 

0 

0 

0 

0 

0 

0 

0 

0 

30 

0 

0 

0 

0 

0 

0 

0 

0 

0 

31 
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0 

0 

0 

0 

0 

0 

0 

0 

32 

0 

0 

0 

0 

0 

0 

0 

0 

0 

33 

0 

0 

0 

0 

0 

0 

0 

0 

0 

34 

0 

0 

0 

0 

0 

0 

0 

0 

0 

35 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Average 

5.8 

6.0 

5.2 

11.0 

10.7 

10.6 

21.6 

22.0 

21.3 

Standard 

Deviation 

0.88 

0.26 

1.35 

0.28 

1.61 

2.54 

2.71 

0.26 

3.79 


Table 2: JACK client activation count in respect to media clock stream 
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Comparing the two AVTP backend configurations for each sam¬ 
ple period size shows, except for some outliers that account for less 
than 1% of the activation counts, that both configurations provide 
a equivalent solution. The averages and standard deviations of all 
sample period configurations imply a synchronized JACK client and 
backend. The required media clock stream packets per JACK period 
from table 1 are mostly met, with slight deviations. 

5. CONCLUSIONS 

Inherently, the ALS A backend for JACK adds some drift to the signal 
processing chain inside the Soundjack server network. Therefore, an 
experimental IEEE 1722 AVTP media clock backend for JACK was 
developed to overcome this problem. We could show that our solu¬ 
tion for this problem is working and provides the desired synchro¬ 
nization and it is not necessary to adjust the duration of each JACK 
period with nanosecond accuracy. 

Since the AVTP backend only receives AVTP packets, it is the¬ 
oretically possible to run the backend on any PTP enabled device, 
even when no prioritized transmission queues are provided by the 
hardware - Intel 1217 for example. 

6. FUTURE WORK 

Future work will focus on testing the Soundjack server network setup 
in the real world, the public Internet instead of the campus network, 
therefore adopting IPv6, with evaluation of the changes to the net¬ 
work tomography, has to be done. 

Furthermore, the AVB processing server network shall in the fu¬ 
ture be migrated to function as a completely AVB capable JACK 
backend, not just for media clock synchronization. 

It will also be of interest to achieve a synchronization between 
client and server via the public Internet. Mechanisms best suited for 
this feature are already under investigation. 
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ABSTRACT 

Tpf-tools are used to establish bi-directional, low-latency, multichan¬ 
nel audio transmission between two or more geographically distant 
locations. The tool set consists of a server part (the tpf-server) and 
a client part (the tpf-client) and is heavily inspired by the JackTrip 
utility. It is based on the same protocol. It facilitates the handling 
of many concurrent audio transmissions in setups with more than 
two endpoints. Also, it eliminates the requirement of one endpoint 
having a public IP address or port forwarding configuration. 

1. INTRODUCTION 

The JackTrip[l] utility has proven to be a very useful and versatile 
tool for our research into the so-called telematic performance format 
(tpf), staged (musical or other kinds) events that take place simulta¬ 
neously at two or more geographically distant concert venues. For 
these concerts, the stage is designed to blend physically present local 
performers with their remote counterparts, represented by means of 
low-latency video (UltraGrid 1 ) and audio (JackTrip) transmission. 

1.1. The obstacles of current IP networks 

We have successfully used the JackTrip utility in many of our telem¬ 
atic concerts. The utility operates in two modes: client mode and 
server mode. For an audio transmission to take place, one end runs it 
in server mode listening for an inbound connection, while the other 
end runs it as client, thus initiating the connection. This works well 
so long as the client "sees" the IP address of the server. In today’s 
Internet, most computers touched by human beings are assigned an 
IP address from a local area network (LAN) which is protected by 
a NAT router 2 . Public IP addresses are usually only assigned to 
headless servers and - apparently - NAT routers, but not to devices 
touched by humans. This topology divides the Internet in service 
providers and consumers and reflects the predominant capitalist ide¬ 
ology of today’s Internet [2, Chapter 5]. At the same time, it hin¬ 
ders our efforts to perform telematic concerts. Running JackTrip in 
server mode at a concert venue requires a computer that has either a 
public IP assigned, or the proper port forwarding configured on the 
local network router. At venues where the performers are not the 
owners or administrators of the local network, this often bears huge 
administrative overheads and dealing with IT staff who may be more 
concerned about security than artistic achievements. 

'Software for low-latency video transmission http://www.ultragrid.cz/ 

2 NAT (network address translation) routers separate the LAN from the 
Internet. This increases security, because local computers are invisible from 
the Internet. It is also a way to deal with IPv4 address exhaustion, because 
all devices of a local network share one public IP address for outbound con¬ 
nections. 


1.2. The complexity of many nodes 

Another complexity we have encountered is the planning and set up 
of JackTrip connections when, not two, but three or (for a test situa¬ 
tion) four venues are participating in an event. Two endpoints require 
one link. Three endpoints require three links, while four endpoints 
require six links. The number of links grows quickly with the num¬ 
ber of endpoints. Events with more than two nodes require meticu¬ 
lous and careful planning. 

1.3. Our motivation 

We are looking for ways to streamline our processes and improve 
our tools in order to be able to shift our focus away from technical 
to more artistic aspects. JackTrip is the tool of our choice, because it 
is multi-platform, open source, uses JACK 3 and thus integrates well 
with existing professional audio software (e.g. Ardour). However, 
we saw an opportunity in adding a higher layer on top of the strong 
basis JackTrip gives us. In our efforts, we have developed a tool set 
that addresses the obstacles we’ve been experiencing: 

• None of the endpoints need a public IP address. 

• The client manages the audio transmissions to many endpoints 
and abstracts the complexity of such setups away, while pre¬ 
senting a simple, yet comprehensive interface to the user. 

In this paper we present our tool set consisting of the tpf-client 4 
(the software that is running on each participating endpoint) and the 
tpf-server 5 (the software that enables communication between the 
clients and coordinates audio transmissions). 

2. VARIOUS CONNECTION MODES 

2.1. Client connects to server (standard mode) 

The JackTrip utility is designed so that both ends are sending simi¬ 
larly formatted UDP 6 packets. In server mode, it opens a listening 
socket that awaits for incoming connections. As soon as a packet ar¬ 
rives, it starts sending packets to the sender address of the incoming 
packets. In client mode, it immediately starts sending packets. The 
transmission is established as soon as both ends are up and running. 
This only works when the IP address of the server is visible to the 
client. 

Hack Audio Connection Kit, a sound server daemon for connecting audio 
applications and sound cards, http://www.jackaudio.org/ 

4 The tpf-client is available at https://gitlab.zhdk.ch/TPF/tpf-client. 

5 The tpf-server is available at https://gitlab.zhdk.ch/TPF/tpf-server. 

6 User Datagram Protocol, a connectionless protocol based on the Internet 
Protocol that operates on the Transport Layer (Layer 4) of the OSI model. 
Applications with a strong focus on low latency often use it for transport. 
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2 . 2 . Two clients connect to each other 

A transmission can also be established when running both endpoints 
in client mode so long as both clients specify both bind port and peer 
port. The peer port of the first client matches the bind port of the 
second client, and vice versa. 

An example of a JackTrip setup with both instances running as 
client: 

$ jacktrip -c 192.168.0.12 —bindport 2000 
—peerport 3000 

$ jacktrip -c 192.168.0.11 —bindport 3000 
—peerport 2000 

This requires both ends to have an IP address visible to the other 
party. If one or both endpoints are hidden by a NAT-firewall, a con¬ 
nection cannot be established. However, this setup shows that the 
JackTrip design does not mandate one party to run as server. 

2.3. Connection using a UDP proxy 


3.1. Implementation 

The tpf-server presented here uses a Python 8 script as subscription- 
based UDP proxy. It uses two dictionaries (diets) that are empty 
at start-up: a token diet and a link diet. The token diet stores the 
token string and sender adress when a token message is received. 
The token message is a UDP packet containting a string like 

_TOKEN XXXX 

where XXXX is the token string, an arbitrary string of arbitrary length. 
If a token message is received, its token string is looked up in the to¬ 
ken diet. If there is no entry found, an entry is added to the token 
diet with the token string as key and the sender address as value. If 
another token message is received carrying the same token string but 
from a different sender address, two entries are made to the link diet. 
The first entry uses the address from the token diet as key and the 
sender address of the last token message as value. The second en¬ 
try uses the same two addresses, but key and value are interchanged. 
After creating the entries to the link diet, the respective entry in the 
token diet is deleted, so that the same token may be used later by 
another party. 


The fact that a transmission can happen with two endpoints both 
running in client mode is crucial for the next step: establishing a 
transmission where none of the endpoints are assigned a public IP 
address. Since we want both endpoints to run in client mode, we 
need a third party that has assigned a public IP address and thus 
is visible for both endpoints, even when they are behind a firewall. 
This third party acts as proxy for both endpoints by relaying pack¬ 
ets from client A to client B and vice versa. This technique passes 
most types of firewalls easily because the client initiates the connec¬ 
tion. It works transparently for both endpoints as they do not have 
to know their respective peer’s IP address. They simply connect to 
the UDP proxy. Since the JackTrip packet format is agnostic of the 
underlying transport protocol, all connection specific details are part 
of the UDP header and the payload does not contain any reference 
to the client address or port number. This allows the UDP proxy to 
relay incoming datagrams as is, without inspecting or changing the 
payload. 


Incoming UDP datagram 

sre: 

dst: 

12.54.7.7:30001 

195.175.247.53:4460 


Link Diet 


sre: 

dst: 

62.32.31.237:50102 

121.211.107.157:43211 

121.211.107.157:43211 

62.32.31.237:50102 

| 12.54.7.7:30001 

98.65.4.4.30005 

98.65.4.4.30005 

12.54.7.7:30001 

UDP proxy listening on 195.176.247.53:4460 

Outgoing UDP datagram 

sre: 



195.175.247.53:4460 98.65.4.4.30005 


Figure 1: Subscription-based UDP proxy. 


3. SUBSCRIPTION-BASED UDP PROXY 

The simplest variant of a UDP proxy knows exactly two endpoint 
addresses and relays packets between them. However, this design 
mandates that each parallel transmission uses an instance of the UDP 
proxy, each listening on a dedicated port. The purpose of the sub¬ 
scription-based UDP proxy is to allow many parallel transmissions 
on the same port. To know which endpoints belong to a certain trans¬ 
mission, the endpoints send a so called token that is unique per trans¬ 
mission. If two clients send the same token, a transmission between 
those endpoints is established. This design allows an arbitrary num¬ 
ber of transmissions to run on the same port, and each transmission is 
protected from intentional or unintentional interference by the token. 
Because of the requirement to send a token, the subscription-based 
UDP proxy does not work with the traditional JackTrip, at least not 
out-of-the-box 7 . Also, both parties intending to participate in a 
transmission must first agree on a common token through a separate 
channel. 


7 JackTrip could be wrapped into a script that first sends the token using 
the same bind port before it starts JackTrip 


Since the UDP protocol does not guarantee that packets reach 
their destination, the client must keep sending token messages at a 
low rate (i.e. one message per second). When the client receives a 
packet for the first time, it stops sending token messages. 

3.2. Considerations 

Creating two entries per transmission into the link diet seems like a 
waste of memory, but it allows for a very quick look-up to determine 
the destination on an incoming packet. Keeping the latency low has 
the highest priority in our use case. 

Although Python, as an interpreted language, is not among the 
fastest, it was the preferred choice for rapid prototyping and exper¬ 
imenting. It turned out that the UDP proxy written in Python was 
never the bottleneck in our performance tests and although it causes 
some CPU load under load, it does not seem to add a significant la¬ 
tency to the UDP transport. There has not yet been a pressing need 
to rewrite the UDP proxy in a more performant way. 

8 Python is an interpreted programming language supporting many 
paradigms, https://www.python.org/ 
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4. THE TPF SERVER 

The complexity of a setup increases quickly with the number of par¬ 
ticipating endpoints, as we showed before. We wanted software to 
manage the complex part of handling many parallel audio transmis¬ 
sions. The engineer should not have to deal with many terminal win¬ 
dows for running many JackTrip instances and know what IP address 
and port number each of their peer uses. Simplifying the involved 
processes was the main motivation for defining a protocol [3] and 
writing a server software implementing that protocol. It is worth 
noting that this part is orthogonal to the problem of audio transmis¬ 
sion. The tpf-server is not involved in transmission of any audio data. 
Rather, it enables clients to know about each other and to let them 
initiate audio transmissions. The communication between server and 
clients uses TCP and runs on different ports. 


sages whose OSC address starts with /s/tpf are handled by the 
tpf-server. The protocol is built on top of the protocol of the netpd- 
server. The exact protocol specification is part of the tpf-server pack¬ 
age [3], Since the protocol is based on OSC, it is agnostic of any 
software framework or programming language. It could be imple¬ 
mented in any language where libraries exist to deal with network 
sockets and the OSC protocol. It was implemented in Pure Data, 
because it uses parts already written in Pure Data. The tpf-server 
keeps track of the connected clients and coordinates a few common 
parameters that the endpoints must agree on before they are able to 
establish an audio transmission. It manages a few data containers 
and notifies clients about updates when data is changed. The tpf- 
server sends current data to the clients upon their request, while it 
is the duty of the clients to request data if they receive an update 
notification from the server. The data containers include: 


4.1. Based on netpd-server and OSC 

In order to reduce development efforts, the design is based on exist¬ 
ing software - the netpd-server 9 - that was extended to implement 
the tpf-server presented here. The netpd-server is a relay for OSC 
messages and was developed for the netpd [4] project, a framework 
based on Pure Data (Pd) [5] that allows geographically remote clients 
to do electronic music together in real-time by synchronizing instru¬ 
ment states. The netpd framework uses OSC [6] for the communica¬ 
tion, while OSC messages are encapsulated by SLIP [7] and trans¬ 
ported by TCP. The OSC 1.1 specification [8] proposes SLIP to de¬ 
limit OSC messages when transported by stream-oriented protocols 
such as TCP. While many OSC applications use UDP for transport 
for simplicity and speed, data integrity and correct order are crucial 
for the netpd framework. Also, for the tpf-server, whose purpose is 
to coordinate clients and allow them to share data, and which is not 
involved in the audio transmission directly, reliability trumps speed. 
TCP has a notion of connection, so for a server using TCP, there is 
no ambiguity in knowing when a client joins or leaves. With UDP 
it is much harder to clearly determine a client’s state (e.g. joined or 
left). 

4.2. netpd-server 

The netpd-server defines rules about how incoming OSC messages 
are forwarded to the connected clients. This allows clients to send 
messages to specific peer clients, broadcast messages to all clients, 
or send messages to the server itself. The netpd-server forwards OSC 
messages according to the first element of the OSC path. The set of 
supported values for this field is listed here: 


field 

forwarding action 

b 

s 

<int> 

message is broadcast to all connected clients 
message is intended for the server itself (not forwarded) 
message is forwarded to the client with ID <int> 


Table 1: List of valid receivers 


4.3. The tpf-server internals 

The tpf-server loads the netpd-server as an abstraction [9]. It re¬ 
serves the OSC name space / s/tpf, which means all received mes- 

9 The netpd-server is part of the netpd framework developed by Roman 
Haefeli. The code is hosted at https://github.com/reduzent/netpd-server 


4.3.1. Client ID And Name 

When a client connects, the tpf-server assigns it a unique client ID 
(unique in the scope of the session). This ID, usually a small integer 
number, is used to identify each client. The same ID is also used 
to send an OSC message to a specific client by putting it into the 
first field of the OSC path. After establishing the connection to the 
server, the client registers a name (e.g. given name or location). It 
allows clients to display the list of connected peers in a more human- 
friendly way (see Client List). 

4.3.2. Parameter List 

The client with the smallest ID. usually the one that connects first to 
the server, is given a special role: it has the authority to set or change 
a set of parameters that all clients are mandated to use for the current 
session - samplerate, blocksize, bit resolution. Those parameters are 
distributed to all clients and the clients either adjust their settings or 
report an error when a mismatch occurs. The parameter list is not a 
hard-coded set. Instead, it is fully defined by the clients. 

4.3.3. Client List 

The tpf-server keeps a list of all connected clients with their ID, 
name. IP address and role. Whenever a client connects or discon¬ 
nects, the tpf-server broadcasts an update of this list to all clients. It 
is therefore crucial that clients terminate their connection properly, 
otherwise they keep appearing in the client list until the connection is 
considered terminated. This period depends on the operating system. 

4.3.4. Link List 

In a full mesh network, each node is linked to every other node. If n 
is the number of nodes, the number of links (/) is: 


The tpf-server assigns each pair of clients a link ID, so each link 
ID associates two clients. The tpf-server sends each client its own 
list of their peer’s client IDs along with the corresponding link ID. 
Clients use the link ID to establish the audio transmission to a spe¬ 
cific peer. Early versions used one server port per transmission and 
tpf-client used the link ID as the port offset parameter for running 
JackTrip. In the current version, the link ID is used to generate a 
token string. Two clients using the same ID and thus the same token 
string are linked by the subscription-based UDP proxy. 
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When every transmission was using its dedicated UDP port, it 
seemed appropriate to let the server, as a central authority, assign 
link IDs to avoid collisions, but also to ensure that only IDs cor¬ 
responding to an active UDP proxy would be assigned. With the 
subscription-based UDP proxy, this coordination task became moot, 
as clients could also negotiate a token by peer-to-peer communica¬ 
tion without involving the server. Future versions of the tpf-server 
might remove the link list. 

5. THE TPF-CLIENT 

5.1. Written in Pure Data 

The tpf-client is implemented in Pure Data, so it can be built on top 
of an already existing framework. For communication to the server, 
parts from the netpd client have been reused. Designed as a real-time 
audio programming language, Pure Data has already covered many 
aspects of dealing with low-latency audio. Furthermore, part of the 
Pd "eco system" is a vivid community that has been contributing 
many libraries extending the functionality of the software. Namely, 
there are so-called externals for parsing and formatting OSC mes¬ 
sages (ojc) and for accessing network sockets ( iemnet ). Pure Data 
has native JACK support built-in and runs on a variety of platforms. 

5.2. Implementation 

The purpose of the tpf-client is to manage audio transmissions to one 
or many peers joining the same session. It is the implementation of 
the client side of the tpf protocol. First drafts only implemented the 
management aspects in order to get the necessary information for 
starting the original JackTrip utility with the appropriate command¬ 
line arguments, so the audio transmission part was left completely 
to JackTrip. It was later decided to also re-implement the JackTrip 
utility as an abstraction. 

5.2.1. Rewrite of JackTrip as Pd abstraction 

Implementing the audio transmission part in Pure Data has some ad¬ 
vantages: 

• The lack of a stable and feature-complete Pd external for run¬ 
ning system commands makes it hard to consistently control 
many JackTrip instances from Pd. JackTrip reimplemented as 
a Pd abstraction is easier to control and interface with. 

• An implementation of the JackTrip protocol in Pd allows to 
extend it, if necessary. A small addition - the subscription by 
sending a token message - to the JackTrip functionality was 
necessary to support the subscription-based UDP proxy. 

• Although able to create many JackTrip connections, the tpf- 
client appears as one JACK client, which somewhat simplifies 
the process of drawing connections in the connections dialog 
of QjackCtl. 

• Since the audio signals travel through Pd, some signal pro¬ 
cessing could be applied. The current implementation doesn’t 
apply any processing, though. 

• Since the audio signals travel through Pd, signal level mon¬ 
itoring can be used and graphically represented in the client 
user interface. 

• Signal path can be used to measure round-trip time of the au¬ 
dio signal with built-in latency meter. 


5.3. User interface 



Chat Latency Messages 

Figure 2: The tpf-client user interface. 

The user interface displays a few configuration parameters that 
are settable before the connection to the server is initiated: 

• name 

• hostname (or IP address) of the tpf-server 

• blocksize (of the JackTrip packets) 

• number of channels (outgoing) 

• queue buffer size 

The samplerate and bit resolution cannot be changed in the client. 
The bit resolution is hard-coded to 16 bit. The samplerate is man¬ 
dated by the JACK server and is inherited by Pd. After the connec¬ 
tion is established, those configuration parameters become locked 
and cannot be changed until the session ends. 

The client registers its name and either uploads the audio pa¬ 
rameters such as samplerate, blocksize, bit resolution to the server 
or matches them against the mandated parameters, if another client 
already has configured those parameters. If there is mismatch be¬ 
tween configured and mandated parameters, the client either reports 
an error (mismatch with samplerate, bit resolution) or silently ad¬ 
justs the parameter (mismatch with blocksize). It is worth noting that 
blocksize configured in the tpf-client is decoupled from the block- 
size used by the JACK server. This allows clients to ran JACK with 
deviant blocksizes. After successfully having registered the name 
and matched audio parameters, the connection button (top left) turns 
blue to indicate that the client is ready for audio transmissions. 

5.4. Managing transmissions 

Peer clients are each listed in a separate row in the client interface. 
Audio transmissions are not started automatically, but are initiated by 
a user on either side by clicking the left-most button in the row. The 
button on the respective row on the peer’s client starts flashing. Only 
when confirmed by the other end by clicking on the flashing button 
is the audio transmission started. The number of received channels 
is represented by the number of squares turning from grey to black 
in the respective row. Depending on the signal level of each chan¬ 
nel, the square changes color from black (silence) to bright green 
(full amplitude). The number in each square corresponds with port 
number of the tpf-client in the QjackCtl connection dialog. 

5.5. Transmission monitoring 

During an audio transmission, three types of glitches are counted and 
displayed in the respective row: 
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DROP number of dropped packets. Late packets that miss their 
time frame to be played back are also considered dropped. 

GLITCH number of audible audio glitches. Often, many packets 
are dropped in a row, resulting in one audible glitch. Thus, the 
number of audible artefacts is always smaller than the number 
of dropped packets. 

OOO number of packets received out of order. If an out-of-order 
packet misses its time frame, it is dropped. Otherwise it is 
played back in correct order. 

All counters are reset to zero when the audio transmission is 
restarted. Although those counters are not of much use during a real 
concert (not much can be done about bad statistics), they might help 
compare the quality of different network links, when testing setups 
or internet providers. 

5.6. Message and chat window 

Beside the main window, tpf-client’s interface has a message win¬ 
dow, where info, warning and error messages are displayed. There is 
also a built-in chat in the chat window. A channel of communication 
not involving audio is often desired. 

5.7. Built-in latency measurement tool 

To measure the overall round-trip time of the audio signal, both end¬ 
points need to configure the audio path accordingly. The method is 
robust enough to allow the signal to be played back by a speaker 
and recorded with a microphone, even in a mildly noisy environ¬ 
ment. The signal path of a full round-trip measurement is shown in 
Figure 3 . 



Figure 3: Signal path of latency measurement. 


5.8. Adding artificial latency 

The tpf-client allows each audio transmission to add an artificial au¬ 
dio delay. By adjusting the delay, it is possible to target a specific 
total round-trip time. Reasons for latency adjustment include: 

• The performance of a certain musical piece requires the per¬ 
ceived latency to be aligned to the given tempo of the work. 


• In a three-node setup, where one peer location is far more 
remote than the other, the un-adjusted latencies differ signif¬ 
icantly, so it might be desired to "harmonize" the perceived 
latencies by artificially increasing the "distance" of the closer 
peer. 

5.9. Considerations 

Certain aspects of writing software in Pd are difficult. Designing a 
graphical user interface is relatively hard and the graphical represen¬ 
tation is bound to pixel sizes and cannot be scaled dynamically (i.e. 
by resizing the window). Also, it is not possible to create dynamic 
interfaces that display different content depending on context. Due 
to those limitations, it was decided to restrict some capabilities of 
the client in order to provide a simple and consistent interface. The 
number of channels per audio transmission is limited to 8. Also, the 
maximum number of displayed peers and thus the number of con¬ 
current audio transmissions is limited to 8. This limits the overall 
number of connected client being able to interact with each other to 
9. Those limitations are not imposed by the tpf-server or the pro¬ 
tocol, and the client could be adapted if need be. They are abitrary 
choices and during the past year of using the tpf-tools, those limits 
never have been reached in real life. 

Unlike the original JackTrip implementation, each party in a 
setup using the tpf-tools can choose the number of channels to be 
sent individually. This saves bandwidth and might improve qual¬ 
ity. Also, the configured blocksize is not dependent on the blocksize 
mandated by the JACK server. This can be an advantage, since the 
value for the most optimal JACK configuration might differ between 
clients. 

6. EXPERIENCES AND DISCUSSIONS 

We were interested to know how the usage of the tpf tools impacts 
audio quality and overall latency. We performed tests to compare 
the usage of the UDP proxy with a traditional JackTrip client-server 
connection. We wanted to know whether the usage of the UDP proxy 
has an influence on the number of dropped packets. In another test, 
we examined the latency differences between using a UDP proxy 
and a direct JackTrip connection. We also examined, whether the 
tpf-client imposes a penalty to the quality of the audio transmission 
compared to the original JackTrip. 

6.1. Dropped packets imposed by UDP proxy 

For counting glitches (which are a result of dropped packets), we sent 
a lkHz-sine-tone through JackTrip to a remote JackTrip instance, 
that looped back the signal, and recorded the result for a predeterm¬ 
ined period of time. We used the -z commandline option of Jack- 
Trip, so that glitches were visually more easy to spot in the wave¬ 
form. Then we counted the glitches by loading the recorded sound 
file into a sound editor and examining the discontinuities in the wave¬ 
form. We were not able to determine a significant difference between 
a direct link and a link using the UDP proxy. At another instance, 
that was totally unrelated to the test series, we experienced many 
dropped packets. We later found out that the reason was a bug in the 
driver of the virtual network interface of the virtual machine the UDP 
proxy is running on. While the UDP proxy usually does not impact 
the number of dropped packets negatively, there is a plethora of pos¬ 
sibilities as to why the UDP proxy might behave badly, because it 
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depends on hardware, on the operating system and of the software 
itself. These sources of error do not apply to a direct JackTrip link. 

6.2. Latency imposed by UDP proxy 

At the time of comparing the latency of a direct link to the UDP 
proxy, the tpf-client had not been written. So a simple tool in Pd was 
built to send a single UDP packet to a remote location, that sends 
back the packet immediately. The tool measures the delay betwen 
sending and receiving the packet. The average travel times turned out 
to be identical for both, a direct link a link using the UDP proxy. This 
behavior was consistent with different remote locations. This can be 
eppxlained by the fact that both, the computer taking the samples and 
the server running the UDP proxy, were located at the same campus. 
By using tools like mtr or traceroute, we found out that the number of 
hops between the computer taking samples and the remote computer 
was the same for both link types. In a scenario where both endpoints 
are located outside the campus hosting the UDP proxy, using the 
UDP proxy adds additional latency. The amount depends on how far 
the UDP proxy is away from the direct network path between both 
endpoints. 

6.3. Performance of the tpf-client 

We also tried to examine the impact of using tpf-client compared 
to the original JackTrip. It turns out that Pure Data adds one block 
of additional latency, because the way it communicates with JACK 
decouples its audio processing front the strict graph of the JACK 
server. Many other JACK clients like JackTrip are tightly coupled 
and do not add additional latency. When using a blocksize of 128 at 
a samplerate of 48kHz, the penalty of using tpf-client is 2.6666 ms. 
It increases with larger blocksizes or lower samplerates. 

Because Pd interfaces the JACK server differently, it is possible 
that Pure Data’s audio processing experiences audio drop-outs while 
the JACK server does not. This means that the tpf-client introduces a 
new source of possible buffer underruns. From our experience, this 
theoretical penalty has not become manifest in more glitches when 
using tpf-client, at least not when running tpf-client on a macOS 
system. On Linux, Pure Data was found, in some situations, to be 
the source of glitches when not running with realtime privileges. It 
was usually simple to remedy the situation. 

6.4. Shortcomings of the JackTrip protocol 

While measuring the number of glitches with different combinations 
of blocksize and number of channels, we found there was a sudden 
increase in glitch rate when the number of channels exceeded a cer¬ 
tain value. When running two parallel transmissions with each only 
carrying half the channels, we experienced a low rate of glitches. By 
running other tests with the tool iperf, which allowed us to set the 
rate and size of UDP packets, we found that link capacity was only 
one limiting factor. Not less important was the so-called Path MTU 
10 . UDP packets larger than the Path MTU are fragmented during 
transport. The loss of a single fragment results in the loss of the 
whole UDP packet. The likeliness of a UDP packet being dropped 
increases with the amount of fragmentation it experiences. For best 
performance, the UDP packet size should not exceed the Path MTU. 

10 Maximum Transmission Unit, is the maximum packet size that is a trans¬ 
mitted in a single network layer transaction, while Path MTU refers to the 
maximum packet size that is transmitted through all intermediate hops with¬ 
out fragmentation. 


By running tests with iperf, we were not able to saturate a network 
link with a UDP stream, when choosing a relatively large packet size 
(e.g. 16000 bytes). By selecting a smaller packet size (e.g. 1400 
bytes), we were able to achieve a data transfer rate close to the theo- 
rethical maximum while still keeping the number of dropped packets 
low. This finding shows that the JackTrip protocol is not suitable for 
all kinds of payloads, since the UDP packet size depends on bit res¬ 
olution. number of channels and blocksize: 


packetsize — H UDP -1- -Ujacktrip -1- ./V channel X X /Jhuflcr (2) 

o 

where JTudp = Header size of UDP datagram, 

JTjacktrip = Header size of JackTrip frame, 

JVchannei = Number of channels, 
bres = bit resolution, 
buffer = buffer size 

Larger numbers of channels or blocksize result in UDP packet 
sizes bigger than the optimal size. With a typical Path MTU of 1500. 
and a given blocksize of 128, the largerst number of channels still 
fitting into the Path MTU is 5 (1296 bytes). A single audio transmis¬ 
sion with a high number of channels could be split into two or more 
parallel transmissions with a lower number of channels in order to 
reduce the resulting packet size. However, synchronization between 
the transmissions is not guarantueed and therefore this is not a suit¬ 
able solution. The ability to detect the Path MTU and to optimize 
UDP packet size by splitting a transmission into many, while keep¬ 
ing synchronicity, are features that still need to be researched. 

6.5. UDP hole punching 

While there is none or only a negligible penalty for using the UDP 
proxy when it is located close to one participating party, it might 
add significantly to unacceptable latency, when the participating par¬ 
ties are all located geographically distant front it. In terms of net¬ 
work latency, using a direct link is sometimes as good, and in many 
cases clearly superior to using a proxy. A technique called UDP 
hole punching allows us to establish a direct UDP conncetion be¬ 
tween two end-points, both acting as client, that is able to traverse 
many types of NAT-firewalls. NAT-firewalls usually let an incoming 
UDP packet pass, when its receiver address (IP and port) matches the 
sender address of a previously outgoing UDP packet. That is because 
UDP is a stateless protocol and has no notion about connection. That 
is how NAT-firewalls discern outbound connections (that are usu¬ 
ally allowed) from inbound connections (that are usually blocked). 
Before establishing the connection, both endpoints contact a central 
server to learn about their peer’s public IP address and port number. 
Then they start sending packets to the address they learned. Because 
this happens on both sides, the firewall on either side "thinks" the 
connection was initiated from a local client and it will pass incoming 
packets. The technique is already used in webRTC and IP telephony 
applications. The tpf-client supports UDP hole punching as an ex¬ 
perimental feature. By double-clicking (instead of single-clicking) 
the left button in the peer row an audio transmission using a direct 
link is requested. There are still many scenarios where establishing a 
such link fails. Supporting more cases and making UDP hole punch¬ 
ing a viable option is certainly a field worthy of further exploration. 
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ABSTRACT 

Given the relative stagnation in single-thread performance of many 
processors in the recent years, made even worse by the recent security 
findings such as SPECTRE or L1TF which led to restrictions in ex¬ 
isting features and decreased performance for the sake of security, it 
is necessary to find new ways to improve the run-time performance 
of dynamic multimedia systems. In this paper, we present the in¬ 
troduction of a just-in-time compiler in the ossia score interactive 
score authoring and playback software. We discuss in particular the 
creation of a toolchain and software development kit for C++ just-in- 
time compilation on the three major desktop platforms, the challenges 
and benefits caused by the use of C++ in terms of standard library 
requirement, but also the benefits that the system offers in terms of 
live-coding. 

Keywords: interactive scores, just-in-time compilation, toolchains 

1. INTRODUCTION 

Users of multimedia software demand two features which can be hard 
to reconcile. On one hand, they ask for more performance, the ability 
to run more tracks, add more effects, etc. On the other hand, they 
request more dynamic behavior, and easily extensible systems - in 
particular, systems which do not require the user to write Makefiles and 
set-up a compilation toolchain. But such a dynamic behavior generally 
comes at a cost: for instance, Javascript, Lua or Python are often 
integrated with media environments, such as Blender, ossia score, 
and Renoise. These languages can have undesirable properties in low- 
latency audio environments: they can cause spurious dynamic memory 
allocations, which prevents real-time guarantees to be ensured. 

Ongoing advances in just-in-time compilation can to some extent 
reconcile these needs. The LLVM project [7] provides simple APIs 
to integrate compiler and assembler in C++ software, through the 
MCJIT and OrcJIT sub-libraries. 

The benefits of just-in-time compilation have been known for 
a long time [2] ; of particular interest to us is the ability of just- 
in-time compilers to adapt to the exact CPU type available in the 
user’s computer. This can lead to great performance improvements: 
modern compilers are able to generate correctly vectorized code for 
vector instruction sets, such as SSE, AVX, AVX-2, AVX-512 on x86- 
based platforms, or Neon on ARM platforms. But in the traditional 
compilation model, the author of the software has to know beforehand 
for which instruction set the software shall provide optimized routines, 
and either write them manually in assembler or with intrinsincs. use 
compiler-specific extensions such as GCC’s function multiversioning 1 
or resort to manual run-time dispatch to the correct function according 
to detection of the user’s CPU. This leads to an increase in executable 

1 https://lwn.net/Articles/691932/ 


size for all the users of the software, and can be quite time-consuming 
for the developer. Thus, we propose to leverage JIT compilation for 
some of the most performance-critical parts of media software so that 
they can be compiled in the most optimal way for the user’s CPU. 

The proposed system simply compiles C++ code. This is in con¬ 
trast with many approaches such as Faust [11] for audio signal process¬ 
ing, PostgreSQL [13] for improvement of the SQL query performance 
or the language created by Avramoussis et al. for transformation of 
geometry assets in the VDB format [1], These systems all provide 
custom domain-specific languages (DSL) to solve a well-defined task. 
This has the advantage of freeing oneself from C and C++’s compli¬ 
cated legacy and generally simplify the language semantics, but also 
means that: 

• A large amount of work must be provided by the new language 
authors. 

• The language won’t necessarily be subject to new advances 
in compiler development unless its authors keep working on 
it: while some optimization phases can occur at later stage 
if leveraging an existing compiler framework such as LLVM, 
some optimizations require actual knowledge of the language’s 
semantics and thus cannot be applied generically to any DSL. 

• The language may not be able to leverage the existing corpus 
of libraries available in C and C++. 

The system is integrated in the ossia score software [6, 4] for 
media creation. Part of the motivation is to improve run-time perfor¬ 
mance while live-coding: the software currently features a Javascript 
engine which can be leveraged to provide new behaviors at run-time. 
While it is one of the software’s user-base’s favorite features, it comes 
at a cost: no real-time safety due to the Javascript engine performing 
many memory allocations, and huge “context switch” costs between 
the native code world, and the interpreted Javascript engine world. 
The objective is to improve the run-time performance, while retaining 
some of the properties provided by live-coding: for this, Thor Mag- 
nusson gives the hard criteria that a live-coding language should not 
take more than five seconds between code and sound [10], 

We will first give a brief overview of the OSSIA project, and 
of the way just-in-time compilation is introduced into the system. 
Then, we will give some pointers towards the creation of a cross¬ 
platform toolchain which allows to support JIT compilation in the 
three major desktop operating systems, Linux, macOS and Windows. 
Some performance metrics will be discussed. 

2. OSSIA PROJECT 

ossia 2 is an open-source software suite composed of a library (li- 
bossia ) and a graphical user interface ( ossia score ) for managing 

2 https://ossia.io/ 
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communication, mapping and time-scripting between various soft¬ 
ware in interactive multimedia artworks. This toolset is cross-platform 
(Windows, macOS, Linux), cross-protocol (OSC, MIDI, ...). The 
libossia library has been ported to many creative coding tools (Able- 
ton, Max/MSP, PureData, VVVV, Touch Designer, OpenFrameworks, 
Processing...). It simplifies connecting and controlling various digital 
production software together. Its main goals are to facilitate the devel¬ 
opment of time-centric interactive artworks and lower the barrier of 
entry to interactive media creation and authoring for emerging artists. 

The ossia score software’s execution engine is based on a dataflow 
architecture described in [3]. The user interface part leverages a 
modern C++ and Qt-based generic document framework which can 
be easily reused for other document-centric software. It features an 
extensible plug-in API, undo-redo with automatic recovery in case 
of crash, interface injection, serialization, selection handling and 
multiple document management. It is specifically well-tailored to 
hierarchical document structures and enforces strong typing practices. 

This framework has been used in an unrelated software as a test 
of its flexibility: a point-and-click game editor (SEGMent, developed 
with Raphael Marczak 3 ). 



Figure 1: ossia score, the main software leveraging this framework 


3. C++JIT 

We chose to extend ossia score with a C++ just-in-time compilation 
mechanism. The main motivations for this were: 

• Using C++ allows reusing easily large amounts of existing 
code ; for instance digital signal processing libraries such as 
Gamma[12], KFR 4 or FFmpeg 5 . 

• Due to the amount of software built using C++, compiler opti¬ 
misations for this language are still an active research topic [8. 
9], which guarantees “free” performance improvements in the 
following years. 

• ossia score was already integrating Faust, which itself uses 
LLVM, and thus acted as a gateway drug of sorts. 

4. PLUG-IN AND PLUG-IN APIS 

ossia score already provides multiple plug-in APIs: a simple API 
based on defining a unit generator with strong type-safety features 

3 https://scrime.u-bordeaux.fr/Arts-Sciences/Projets/ 
Projets/SEGMent2-Study-and-Education-Game-Maker 

4 https://www.kfrlib.com 

5 https://www.ffmpeg.org 


relating to the input and output ports of the unit generator, and a low- 
level API which allows creating plug-ins that can modify every part 
of the ossia score software: menus, panels, etc. 

The JIT system leverages the existing plug-in APIs: the same 
code can seamlessly be integrated either during the build of ossia 
score, or at run-time. We give thereafter a brief overview of these two 
APIs. 


4.1. Safe process API 

This API only gives the ability to provide a new unit generator to 
the system. Inputs, outputs and controls are given as C++ constant 
expressions, which generates the user-interface code at compile-time 
and guarantee type-safety. The necessary boilerplate being relatively 
low (for C++ code), it is viable to use in live-coding contexts. A 
specific unit generator, for now simply named “C++ Jit process” in 
the software, allows the user to input code using such API, which will 
be live-recompiled ; the corresponding node will be instantiated. 

Algorithm 1 provides an example of a “gain” node, which has 
one audio and one floating-point input, one audio output, and applies 
the gain to the input. 


Algorithm 1 : A naive gain implementation in the “safe” plug-in 
API. The inputs and outputs of the unit generator are declared in 
the Metadata struct. A compile-time mechanism ensures that the 
prototype of the run function conforms to the prototype, and that 
the types of the arguments are correct. This increases type safety at 
run-time when compared to the more traditional C-based solutions 
where the programmer has to manually cast the inputs of the unit 
generator into the correct type according to knowledge not part of the 
type system. 


struct Node 
{ 

struct Metadata : Control::Meta_base 
{ 

static const constexpr auto prettyName = "Gain"; 
static const constexpr auto controls 
= std::make_tuple(Control::FloatSlider{"Gain", 0., 2., 1.}); 
static const constexpr audio_in audio_ins[]{"in"}; 
static const constexpr audio_out audio_outs[]{"out"}; 

}; 

using control_policy = ossia::safe_nodes::last_tick; 
static void run( 

const ossia::audio_port& pi, float g, ossia::audio_port& p2, 
ossia::token_request, ossia::exec_state_facade) 

{ 

const double gain = (double)g; 

const auto chans = pi.samples.sizeQ; 

p2.samples.resize(chans); 

for (std::size_t i = 0; i < chans; i++) 

{ 

auto& in = pi.samples[i]; 
auto& out = p2.samples[i]; 

const auto samples = in.size(); 
out.resize(samples); 

for (std::size_t j = 0; j < samples; j++) 

{ 

out[j3 = in[j] * gain; 

} 

> 

} 

}; 
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4.2. General plug-in API 

This API enables its user to introduce new elements in most parts of 
the software: 

• New menus, panels, etc. 

• Run-time additions to existing data types of the software. 

• File loaders. 

• Network and hardware protocols. 

At the source code level, it mainly leverages the Abstract Factory 
design pattern. A plug-in can define a new interface, identified by an 
UUID. An example is given in algorithm 2: 


Algorithm 2 : An example of interface definition in ossia score. This 
particular interface allows a plug-in to register the handling of new 
file types in the “Library” panel. 


class Librarylnterface : public score::InterfaceBase 
{ 

SCORE_INTERFACE(LibraryInterface, "9b94d974-9f2d-4986-a62b- 
b69e51a4d305") 

public: 

-LibrarylnterfaceQ override; 

virtual QSet<QString> acceptedFiles() const noexcept; 
virtual QSet<QString> acceptedMimeTypes() const noexcept; 

virtual void setup( 

ProcessesItemModel& model 
, const score::GUIApplicationContext& ctx); 
virtual bool onDoubleClick( 
const QString& path 
, const score::DocumentContext& ctx); 

II ... 

}; 


Plug-ins can then register implementations for these interfaces, 
which can be listed and accessed through a global context object. 

The majority of the ossia score codebase is based on this API, the 
actual software being itself merely a set of plug-ins implemented on 
top of the base plug-in framework. The JIT extension discussed here 
is itself a plug-in 6 . 

The original plan for ossia score was to rely on this plug-in API to 
allow prebuilt extensions to be downloaded from a common repository. 
Due to the ongoing development of the software, no ABI (Application 
Binary Interface) stability guarantees are provided, which means that 
plug-ins must generally be recompiled against the source code of 
newer versions. This requires an extensive compilation architecture 
which could not only rebuild and publish new versions of ossia score 
but also the plug-ins regularly. Common service providers such as 
Travis Cl and Appveyor do not provide enough capacity for this to be 
viable for an open-source, volunteer-led project. 

Hence, the plan going forward is to distribute the plug-ins not 
included in the base software under source code form. The JIT system 
looks for addons on startup in the user library folder: for instance 
-/Documents/ossia score library/Addons and simply compiles 
all the source files of the addon together. This guarantees that API and 
ABI breakage do not cause subtle run-time errors since the add-ons 
are compiled against the exact source code that was used to build the 
software, the headers being shipped as part of the package: if the API 
has changed in a breaking manner, the add-on will not be compiled at 
all and the user warned. 

6 https: //github. com/OSSI A/score-addon-j it 


5. A CROSS-PLATFORM TOOLCHAIN 

ossia score being a cross-platform software, it is necessary to ensure 
the same level of support on the three major operating systems: Win¬ 
dows, macOS and Linux. The endeavor was relatively straightforward 
on Linux thanks to the availability of the LLVM libraries and com¬ 
pilers in package managers. In particular, the Linux implementation 
of JIT compilation in ossia score is also able to use system libraries 
instead of the ones provided by the toolchain. The official release of 
ossia score is based on the Applmage mechanism which allows it to 
work on many distribution: as such, it is also necessary to build a 
recent toolchain to be able to target older systems, such as CentOS 7 
or Ubuntu 12.04. 

The complete toolchain, whose build scripts are available at 
https: / /github. com/OSSI A/sdk provides the following libraries: 

LLVM 7.0.1 (8 svn on Windows due to previous versions not 
working), Qt 5.12 , FFMPEG 4.1 , PortAudio , JACK headers , SDL2 
, OpenSSL , Faust. 

5.1. Uniform C++ standard library 

The C++ parts of the toolchain are built against the libc++ standard 
library implementation on all platforms. This is for two reasons: uni¬ 
formity, and licensing. Using a single C++ standard library across 
all platforms guarantees less variance in behavior, which is still fairly 
common for instance across the various implementations in the im¬ 
plementation of standard algorithms, or complex libraries such as 
<regex>. Especially on Windows, the standard library headers pro¬ 
vided as part of Visual Studio are not freely redistributable. This 
means that this would introduce an unacceptable dependency on a 
Visual Studio installation into ossia score. Hence, we use the system 
headers provided by the mingw-w64 project, along with the LLVM 
libc++ standard library. The build process implies a first build of the 
LLVM project, clang compiler and libc++ standard library, which are 
then used to boostrap a second set of LLVM libraries. This is needed 
due to the JIT implementation directly calling into LLVM’s OrcJIT 
API: if we linked directly against the first set of LLVM libraries, there 
would be a standard library mismatch which would in the best case 
fail to link properly, and in the worst case fail at run-time. 

The llvm-mingw project 7 greatly simplified the creation of the 
Windows toolchain. 

5.2. macOS and rpath handling 

macOS is special in that libc++ is the default C++ library implemen¬ 
tation. There is no equivalent to MinGW in the Apple world: the 
only implementation of system headers is the one provided by Apple. 
Those are not under a free license, to the exception of the C standard 
library and Mach kernel headers. 

In addition, the customized clang / libc++ provided by Apple is 
slightly out-of-date when compared to other platform’s implemen¬ 
tations and suffers from some artificial limitations: using various 
C++17 standard library types, such as std: :any, std: : optional or 
std: : variant restricts the deployment to the latest in date version of 
macOS, 10.14, which is not acceptable for multimedia software users 
often restricted to older system versions for the sake of compatibil¬ 
ity. The macOS version of the toolchain thus provides its own clang / 
libc++ build which overcomes this problem. 


7 https://github.com/mstorsjo/llvm-mingw 
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A custom-built clang-based toolchain on macOS will by default 
still link against the system libc++ implementation. The observed 
behavior is as follows: 

• No arguments passed: the compiler hard-codes an absolute 
path to the system /usr/lib/libc++. 1 .dylib. 

• -L$SDK/lib -lc++ -lc++abi: the compiler links the soft¬ 
ware to @rpath/libc++. 1 .dylib. 

It is thus necessary to specifiy the rpath to get working binaries 
during development: -L$SDK/lib -lc++ -lc++abi -Wl,-rpath,/ 
sdk/lib. 


6. BENCHMARKING 

We provide a few performance tests of the system: what advantages 
and what costs actually bring C++ JIT compilation. Benchmarks are 
run on two machines, both running Linux (Kernel: 4.20.8-archl-l- 
ARCH): 

• Machine 1: Intel(R) Core(TM) i7-6900K CPU @ 4.00GHz 
(Broadwell architecture, desktop). 

• Machine 2: Intel(R) Core(TM) i7-8750H CPU @ 4.00GHz 
(Coffee Lake architecture, laptop). 


We measure every time the time taken by the computation for 
various common buffer sizes. Figure 2 gives the measurements for 
the first machine, figure 3 for the second machine. 



Array size 


6.1. Compile times 

C++ is notorious for its slow compile times, due to large amounts of 
header files to include, and the cost of the template instantiation mech¬ 
anism. More recent C++ standards being oriented towards compile¬ 
time computation of most values in a program also leads to an increase 
in compile times. 

On the test machine, a simple node such as the one provided in 1 
takes between 1.3 and 1.5 seconds to compile on an average of five 
runs. A generic test addon providing mock implementations of a few 
interfaces, comprised of 7 source files, 10 header files, for a total of 
428 lines of code which themselves include part of the C++ standard 
library and Boost, takes between 4.5 and 5 seconds to compile on an 
average of five runs. 

LLVM generates bitcode, which could be cached on-disk, and 
be used to make following start-ups faster. This optimization is not 
yet applied and a complete recompile cycle currently occurs for each 
addon on startup. 

The current “interactive” performance characteristics, while much 
slower than what the Javascript interpreter provides, are thus still 
viable for some level of live-coding. 

6.2. Run times: benchmarking gain adjustment 

We discuss here the runtime improvements provided by the system. 
The following cases cases are tested: 

• The gain node of algorithm 1 as provided pre-built in the ossia 
score binary, which must work on a variety of systems and thus 
is not optimized for any kind of vector instruction set outside 
of the x86-64 SSE2 baseline. 

• The same gain node, passed in the system presented in this 
paper which operates at an -Ofast -march=native optimiza¬ 
tion level and is thus able to take into account the user’s actual 
CPU features. 

• A manually optimized version of the gain node, done with 
hand-written AVX intrinsincs. 


Figure 2: Broadwell CPU: average time in nanoseconds to compute 
a buffer. In blue: generic code with the default compilation settings. 
In orange: generic code while built with the JIT system. In green: 
manually-written AVX implementation. 
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Figure 3: Results for the Coffee Lake CPU, following the same nomen¬ 
clature than the Broadwell CPU. 


Figure 4 presents the improvements between the two CPUs, in 
order to help the reader see the differences more clearly between 
figures 2 and 3. 
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Figure 4: Performance difference between the Coffeelake and the 
Broadwell CPU: it is interesting to note that the buffer size heavily 
influences which workloads benefits the most from the CPU improve¬ 
ments. 


6.3. Run times: benchmarking FFT 

For this benchmark, we compare the run time of a Fast Fourier Trans¬ 
form algorithm implemented in the KFR library mentioned earlier. 
This library provides hand-optimized versions for many different in¬ 
structions sets, ranging from SSE2 to AVX2. The results are presented 
ini. The test is done on a large array: 16384 double-precision floating¬ 
point values. 

Machine Generic JIT Time saved 

Broadwell 214 ps 144 ps 32.7% 

Coffeelake 172 ps 107 ps 37.8% 


Table 1: Performance increases yielded by using the proper instruction 
set. 


6.4. Discussion 

A few things are made apparent by the previous benchmarks: 

• In simple cases, it is pointless to try to optimize better than 
what the compiler can: the manually-written AVX version 
is almost never faster than the simple for-loop version when 
optimized by the compiler. 

• The improvement in that case is fairly expected: AVX is able 
to compute almost twice as many floats than SSE2 in the same 
time. 

• In the more complex, hand-optimized case of the FFT, there 
are also important performance benefits. 

• The C++ compile-times are certainly not negligible for large 
amounts of code. Potential paths for improvement could be the 
use of precompiled headers, or upcoming C++ modules. 


In addition, we note that the system does not currently add any 
performance benefits - nor drawbacks - versus compiling the whole 
codebase at -Ofast -march=native. Thus, the system is mainly use¬ 
ful performance-wise in the case where the end-user is not able to 
rebuild the software himself. While on Linux systems this is generally 
not a problem (even though users may use old distributions with com¬ 
pilers unable to support recent editions of the C++ language required 
by ossia score), this is tremendously useful for Mac and Windows 
users where the default toolchain requires mutltiple gigabytes of disk 
space and takes a long time to install. 

7. CONCLUSION 

We presented the integration of a C++just-in-time compilation system 
based on LLVM in an existing media authoring environment, ossia 
score. 

There are multiple further steps that we would like to reach for 
the system: 

• Correct live-reloading of addons. The main problem to handle 
is that a JIT-compiled addon may instantiate new objects in the 
system. These objects must be tracked, serialized and reloaded 
whenever the addon code change: else, due to the ABI of 
objects potentially changing, this will cause runtime crashes. 

• Generation of cross-compiled code. An often requested feature 
for ossia score is to support embedded architectures. While 
the software already builds and run on such systems, it would 
be useful to generate a minimal executable for such platforms 
from a desktop machine, which only contains a given score 
with implementations optimized for the exact system being 
targeted. 

• In longer time-scales, cross-unit-generator optimizations could 
be interesting: in particular, how can the system integrate with 
other languages also based on LLVM such as Faust ? The 
Mozilla team is currently researching cross-language inlin¬ 
ing between C++ and Rust for instance. Combining multiple 
audio nodes written in different languages, and compile them 
together in a single dataflow graph may open further optimiza¬ 
tion opportunities. 

Finally, the JIT denomination for the system could in practice be 
argued: since ossia score is itself an interpreter for a visual language, 
but the execution of the programs of this visual language are done 
only once every part of the system has been compiled to assembly: for 
reasons of safety, we prefer not to launch C++ compilations during 
the execution of a score, since it may seriously hamper the available 
performance of the system. The JIT process still allows this, but the 
user must be aware of the risks in doing so if the score already uses 
most of the machine’s cores for instance. 
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ABSTRACT 

The Stage, a small concert hall at CCRMA, Stanford University, 
was designed as a multi-purpose space when The Knoll, the build¬ 
ing that houses CCRMA, was renovated in 2003/5. It is used for 
concerts, installations, classes and lectures, and as such it needs to 
be always available and accessible. Its support for sound diffusion 
evolved from an original array of 8 speakers in 2005, to 16 speak¬ 
ers in a 3D configuration in 2011, with several changes in speaker 
placement over the years that optimized the ability to diffuse pieces 
in full 3D surround. This paper describes the evolution of the design 
and a significant upgrade in 2017 that made it capable of rendering 
HOA (High Order Ambisonics) of up to 5th or 6th order, without 
changing the ease of operation of the existing design for classes and 
lectures, and making it easy for composers and concert presenters to 
work with both the HOA and legacy 16 channel systems. 

1. INTRODUCTION 

We have been hosting concerts at CCRMA since it was created in the 
70’s. In 2009 we started expanding our concert diffusion capabilities 
while gearing up for the inaugural season of a new concert hall being 
built at Stanford, the Bing Concert Hall. In 2013 we were able to 
use our newly created GRAIL system (the Giant Radial Array for 
Immersive Listening) to diffuse concerts with out own “portable” 
speaker array with up to 24 speakers and 8 subwoofers arranged in a 
dome configuration for full 3D surround sound diffusion [1], 



Figure 1: CCRMA Concert in the Bing Studio with the GRAIL 


(full 3D) 3rd order Ambisonics. Our upgraded GRAIL concert dif¬ 
fusion system was also able to render up to 3rd order Ambisonics, 
or even 4th order if some errors in rendering were ignored. This 
was made possible by the publication of algorithms that allowed the 
design of HOA decoders for irregular arrays [2]. In particular, the 
release of the Ambisonics Decoder Toolkit software package written 
by Aaron Heller [3] [4], which included software implementations 
of the aforementioned research, simplified the task of designing de¬ 
coders. This work enabled the creation of successful diffusion strate¬ 
gies for irregular speaker placement in the Bing Concert Hall and its 
rehearsal space (the Studio), as well as other spaces. Both systems 
benefited from an open architecture based on the GNU/Linux oper¬ 
ating system and many free audio software packages that, combined, 
allowed us to tailor the system to our specific needs. 

We have curated many concerts with content of varied spatial 
resolution. As composers went on to create works requiring more 
speakers for a higher Ambisonic order decode, the limitations of 
our systems became apparent. While Ambisonics is well known 
for a graceful degradation of the spatial resolution when not enough 
speakers are available for the original order of the piece, the state of 
research and artistic creation was moving towards orders that were 
higher than what we could support. 

1.1. From WFS tests to HOA in the Stage 

In 2011 we bought 32 small speakers (Adam A3X) to create an ex¬ 
perimental WFS array. Over the next few years we used it for demos 
and classes, but other than a couple of concert performances the sys¬ 
tem was used very sparingly. On the other hand, our Stage concert 
hall had a complement of 16 speakers and 8 subwoofers, which lim¬ 
ited our ability to render full 3D HOA (we had been recently using a 
32.8 system for our off-site concerts). 

In an effort to upgrade our dedicated diffusion space at CCRMA, 
we proposed to re-purpose the “unused” speakers and add them to 
the existing Stage diffusion system. This addition would increase 
the total count of speakers to 48, and preliminary studies determined 
that we would be able to render up to 6th order Ambisonics quite 
accurately. Natasha Barret’s research [5] points to diminishing re¬ 
turns in spatial performance for 7th and higher order decoding, so 
we felt confident that moving to a fifth or sixth order system would 
be adequate for our needs and a worthwhile upgrade. 

The design and implementation of this upgrade ended up being 
anything but easy. 

2. REQUIREMENTS 


By 2011 our Listening Room Studio included a 22.4 speaker ar¬ 
ray in a full 3D configuration (with speakers below an acoustically 
transparent grid floor), which could accurately decode periphonic 
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The existing system in the Stage consisted of 8 movable tower stands, 
each one housing a main speaker (four S3A and four P33 Adam 
high quality mid-field studio monitors) and a subwoofer (M-Audio 
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SBX10). In addition to those, we had 8 Adam P22 speakers hang¬ 
ing from the trusses and arranged as a ring of 6 with an additional 
two more overhead. All 16 speakers could be individually addressed 
from a Yamaha DM1000 mixer, with some limitations as the sub¬ 
woofers were paired to the 8 main speakers - we used their internal 
crossovers - and could not be used by the upper 8 speakers. 

The Stage is not only a concert hall, it is also regularly used for 
classes, lectures, demos and other events that do not need or want a 
high spatial resolution speaker array. In fact, the majority of users re¬ 
quire access to just stereo playback. As the CCRMA concert events 
combine live performers, touring musicians and researchers, many 
concerts do not deal with 3D surround sound and use mostly stereo 
projection. The existing flexible 16 channel system allowed for cre¬ 
ative diffusion using a combination of speakers and provided flexi¬ 
bility in which orientation the space could be used. 

One of the key requirements for the upgrade was that the existing 
system and methods of operation would not be changed. Further¬ 
more, the space sometimes is used to accommodate big audiences 
(for its size), so any addition to the Stage could not permanently en¬ 
croach in the floor space available for setting up chairs for events. 

These varied requirements complicated the design process in 
ways which we had not anticipated. 

We were required to: 

1. have a mode of operation that would keep the existing de¬ 
sign, 8 main speaker and subwoofer towers plus 8 secondary 
speakers hanging from the ceiling trusses, all of them driven 
directly from our DM1000 digital mixer 

2. not degrade the performance of the existing system in any 
way, including the low latency achievable with the digital 
mixer, appropriate for live performances 

3. have a way to easily switch from the basic system to a fully 
expanded speaker array which added 32 speakers, all of them 
controlled through a single Linux based computer similar to 
the one managing diffusion tasks in our Listening Room [6] [7] 

4. have the ability to physically move the additional small speak¬ 
ers positioned at ear level out of the way, so that they would 
not interfere with the existing floor footprint of the diffusion 
system 

5. easily switch between the two modes of operation, preferably 
with “one big switch” that would need no expertise from the 
operator 

6. the system had to be “low cost” 

This created a situation with many mutually incompatible sys¬ 
tem requirements from a design standpoint. 

3. FEASABILITY TESTING 

Before starting the upgrade a practical question had to be answered: 
were the tiny A3X speakers good enough (in quantity) to be able to 
produce enough SPL for a concert diffusion situation? Matt Wright 
and Christopher Jette organized a quick test session in which we in¬ 
stalled 16 speakers in a ring at ear level (on top of chairs and plastic 
bins!) and drove them from our GRAIL concert control computer. 
This test was successful and confirmed that they were up to the task, 
but only if properly equalized, so we could go ahead with the up¬ 
grade. 


4. LOCATION, LOCATION, LOCATION 

Where and how to mount all speakers was a difficult task, made 
harder by the rectangular shape of the room and the presence of 
trusses that hold the cathedral-style ceiling. To arrive at a prelim¬ 
inary even distribution in space we used a simple successive approx¬ 
imation software that treats speaker locations as electrons that repel 
each other, and determines the approximate ideal locations of the 
speakers [8], Additional constraints were introduced in the software 
to “fix” the position of the existing 16 speakers in space (remember 
that our design must be a superset of the existing system), and see 
where the rest of the speakers would fall. 



Figure 2: Ideal projection of speaker locations on a hemisphere (red 
dots: original upper 8 speakers, blue dots: ear level speakers) 

A simple geometrical model of the Stage created in OpenSCAD 
[9] was used to project those ideal locations into the walls and ceiling 
of the Stage, to see where we might approximate the ideal locations 
in space with real mounting points. It was challenging to find loca¬ 
tions which would not be shadowed by the ceiling trusses for most 
of the audience seating space, and in a couple of instances there was 
unavoidable shadowing that we had to ignore. 



Figure 3: OpenSCAD model of the Stage (seen from below) with 
speaker location projections, the cylinders partially represent the 
A/C ducts, the black beams are the lower part of the trusses 

We used ADT (the Ambisonics Decoder Toolkit)[3][4] as a de¬ 
sign verification tool, in particular the energy and particle velocity 
graphs helped us determine if the proposed mounting locations for 
the speakers would provide uniform coverage for the desired Am¬ 
bisonics orders (5ht and 6th order was the goal). Other diffusion 
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methods (VBAP, etc) would also benefit from a uniform spatial dis¬ 
tribution of the speakers. 
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Figure 4: Side view of the Stage with speaker mounting points. Grey 
dots are ideal positions in a hemispherical dome, colored rectangles 
are the real positions 

The final speaker configuration at which we arrived was an ear 
level ring of 20 speakers (the 8 original towers plus 12 additional 
A3X speakers), another ring of 14 A3X speakers mounted on the 
trusses (roughly 20 degrees in elevation above the first ring), and the 
original 8 speakers (roughly 20 degrees of elevation higher) plus 6 
more A3X’s distributed in the upper part of the dome. The 12 ear 
level speakers could not be mounted on stands that would take away 
floor space needed for seating, and had to be able to be moved out of 
the way when not in use. We installed a truss mounted rail system 
and designed telescoping mounts that could be switched between 
the normal listening position and a “parked” position where the 12 
small speakers are moved next to the existing towers. The mechan¬ 
ical design took a long time and several prototypes were built and 
tested. Our final system features custom fabricated mounts made 
from 80/20 extruded aluminum profiles and hanging steel channel to 
facilitate rolling the speakers between locations. 

5. DRIVING MANY SPEAKERS 

One of the difficult aspects of the design process was finding an audio 
routing and distribution technology that would allow us to satisfy all 
the requirements within a reasonable amount of time and with the 
limited budget and manpower available to us. Furthermore, the full 
system needed to be controlled from a computer running GNU/Linux 
(like our Listening Room system), and Linux desktops and laptops 
should be able to connect to it for diffusion tasks. 

For our GRAIL concert sound diffusion system we had been us¬ 
ing a homebrew system which consisted of one half of a network 
snake (the Mamba box), plus some ingenious software in the form 
of a Jack[10] client (jack-mamba [11]), to transform it into a very re¬ 
liable 32 channel D/A converter. While the system proved to be rock 
solid for our concerts, it was not really expandable in a way which 
could satisfy our requirements. 

The first audio technology we explored was MADI. We had used 
RME MADI audio interfaces which had good driver support in Linux 
in our Listening Room system. For this 22.4 system we had to use 
two cards, one RME MADI and one RayDAT. This type of system 
could scale up to the number of inputs and outpus that we needed, but 
we could not find an easy way to control rerouting of connections to 


Figure 5: Speaker mount 


support both modes of operation. The only reasonable cost option we 
found was an RME MADI switching matrix, but switching between 
MADI scenes required several operations on the front control panel, 
and there was no option for remote software control which would 
have enabled us to design a separate simple to use interface. 

Our experience with the ethernet based Mamba digital snake sys¬ 
tem suggested that a similar technology based on ethernet could be 
an answer to meet our requirements. 

There are several protocols that rely on ethernet connections to 
transport audio and interconnect several audio interfaces together. 
The most widespread commercially so far has been Dante, but that 
was ruled out as the protocol specification is closed and proprietary, 
and there is no formal support for Linux. There is one company 
that offers a 128 channel ethernet card with associated Linux binary 
drivers, but there is no guarantee that this will be supported for future 
kernel upgrades and the card and driver combo is extremely expen¬ 
sive. 

AVB (Audio Video Bridging) [12], on the other hand, is an open 
standard with a free software implementation embodied in the ope- 
nAVNu project [13]. Regretfully not many manufacturers have used 
this standard for their products. One product manufacturer we con¬ 
sidered was Motu, as their newer audio interfaces can be connected 
to each other through AVB and standard ethernet cables. Their in¬ 
ternal configuration can be completely controlled through a built-in 
web server which makes it platform agnostic, and there is a pub¬ 
lished API that can use JSON http requests and OSC to remotely 
control all aspects of its operation. A Linux computer could control 
the full system without relying on proprietary software. 

Regretfully the AVNu project does not yet include code for a 
complete Linux-based solution. It would be possible to create one, 
but that would require a substantial software development effort which 
was beyond the scope of the resources available to this project. 

We bought a couple of interfaces for evaluation and experimented 
with using their USB interfaces. In the most desirable MOTU cards 
we found that the implementation of the USB2 class compliant driver 
was limited to 24 channels, which was much less than what we 
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needed (the cards were advertised as having 64 channel I/O through 
USB2, but that was only possible when using their proprietary driver). 
So we were at an impasse. 

5.1. Firmware giveth... 

Almost by chance we found an online reference to a “64 channel 
mode”, and traced it back to a very recent firmware upgrade that 
added a mode selection configuration option to the USB audio in¬ 
terface. The new firmware allowed us to set the maximum number 
of channels handled by the USB class compliant driver to 64 if the 
sampling rate was limited to 44.1 and 48KHz, which was accept¬ 
able for our use case. This is beyond what the USB2 specification 
can do, but it performed well in tests under Linux, and allowed us 
to potentially address all speakers through the GRAIL control com¬ 
puter’s USB2 interface, while multiple additional audio interfaces 
could communicate audio data through AVB. This new feature also 
would enable end users to interface with the finished diffusion sys¬ 
tem using another audio interface with its own USB2 interface. This 
would provide multiple entry points into the system using just USB2, 
making it easily usable by our users. 

A firmware upgrade transformed the Motu hardware into a vi¬ 
able option. But what firmware can give, it can take away, as we will 
see... 

5.2. Digital Mixer Mode 

The first phase of the design centered around finding a configura¬ 
tion that could keep the old setup of DM1000 plus 8 main speakers 
operational with minimal changes. Some simple tests determined 
that routing the DM1000 to a 16A Motu interface through AD AT 
so it would drive the speakers (instead of the DM1000 driving them 
directly) would not change the latency of the system significantly. 
This 16A audio interface would also be the word clock master for 
the whole system, and this basic setup would depend on only the 
DM1000 and that interface being up and running to work. 

This means that the 16.8 legacy system (we will call this the 
"Digital Mixer Mode”) could be kept unchanged, and could be a 
subset of the full 48.8 system (the “OpenMixer Mode”). 

5.3. Routing the Subwoofers 

There was a very long design detour that tried to use the internal 
crossover of the old subwoofers in “Digital Mixer Mode” as they 
were working fine and everybody wanted to keep their well known 
sound. We are going to skip those 4 months and jump straight into 
the design that incorporated new subwoofers much later. 

The subwoofer upgrade proved to be a problem, both from the 
point of view of signal routing and from the specs that they had to 
meet. We wanted to have standalone crossovers when in “Digital 
Mixer Mode”, and software crossovers implemented in the GRAIL 
control computer when in “OpenMixer Mode”. We also wanted to 
have a rather high crossover frequency (originally 110Hz, currently 
about 90Hz) to minimize the cone excursion of the main speakers at 
low frequencies (they are mid-field monitors and almost too small 
for the space, but we love their very precise sound). And we wanted 
a low frequency limit of around 20Hz with enough power to fill the 
room without clipping or distortion. 

The ideal subwoofer that would meet all our requirements does 
not exist (the details of why that is the case are beyond the scope of 


this paper). We ended up buying SVS SB4000 units, and not using 
the internal DSP processing included in the unit. 

The only workable solution we found was to use external pro¬ 
grammable crossovers when the system was operating in “Digital 
Mixer Mode”. We used DBX 260 units and routed them through in¬ 
puts and outputs of the same Motu audio interface used to drive the 
8 main speakers (this back and forth tour added a tiny bit of latency). 
In “Digital Mixer Mode” the DBX crossovers are inserted into the 
signal path by the internal routing of the Motu audio interfaces, and 
in “OpenMixer” mode they are completely disconnected so that the 
GRAIL control computer can directly interface with speakers and 
subwoofers, and provide its own separate digital crossovers. In “Dig¬ 
ital Mixer Mode” the signals going to the 8 main speakers are routed 
to the crossovers which split it between the main speakers and to 
the corresponding subwoofers, in “OpenMixer Mode” all speakers 
are mixed in to the 8 subwoofers. All the signal switching is ac¬ 
complished using the routing matrix that is part of the Motu audio 
interfaces. 

The use of external crossovers also allowed us to properly match 
phase at the crossover frequency and equalize the whole system in 
“Digital Mixer mode” for best performance, something we could not 
do before the upgrade. 

Another 16A Motu interface drives the upper 8 speakers with 
signals that are sent from the digital mixer through AVB and the 
internal routing matrices of both audio interface cards. 

The core system in “Digital Mixer Mode” consists of two Motu 
16A cards, the DBX crossover units and the DM100 digital mixer. 
That not only keeps the same operational characteristics as before, 
but improves the system through better crossovers and speakers. 



Figure 6: Signal routing in Digital Mixer Mode 


5.4. And firmware taketh away... 

In the middle of the design and implementation of the system we 
found that newer Motu interfaces no longer had the 64 channel mode 
configuration option. It turns out that Motu had “unspecified prob¬ 
lems” with it, and removed the feature from their products through 
another firmware upgrade. 

Suddenly the audio interfaces were useless for our purposes (24 
channels instead of 64), with no fix coming from Motu, after all. 
they worked fine with their proprietary drivers. To make a long story 
short, we were able to downgrade the firmware to a version where 
that feature was still supported, and everything worked again. A not 
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very sustainable fix and a hack as we (and possibly random users of 
the system) have to ignore the constant reminders that a “software 
upgrade is available”. 

While software upgradable products offer useful flexibility, you 
never know when something you depend on might go away, or some 
new and exciting capabilities might be added, and in which order 
that might happen. That was not the last problem we had with Motu 
firmware versions. 


5.5. OpenMixer Mode 

With the core architecture now a working reality, we added three 
24Ao Motu audio interfaces hidden in the ceiling trusses of the Stage 
to drive the additional 32 small speakers (two would have been enough, 
but using three made the wiring easier and wiring represented a large 
time expenditure in this upgrade). An additional 16A in the Open- 
Mixer control computer rack (on casters) acted as the interface be¬ 
tween the OpenMixer Linux control computer and the rest of the 
system, using a single USB2 interface. AVB streams are used to 
send and receive audio to all other Motu audio interfaces, and finally 
to all speakers, and changing the internal routing in the audio inter¬ 
faces through JSON http calls configures the audio routing for the 
two main modes of operation. 

An additional 24Ai audio interface in the OpenMixer system 
rack is the entry point in the system for connecting laptops and other 
computers for concert diffusion or other purposes (Windows, OSX 
and Linux are all supported). A single USB2 cable allows us to have 
up to 64 channels of input/output available, which is enough for our 
current needs. AVB and the internal routing of the interfaces is used 
to send signals around. 

Yet another 16A audio interface is used to interface with our 
dedicated Linux desktop workstation which resides on another cart 
together with its display, keyboard and mouse. A total of 8 Motu 
audio interfaces interconnected through AVB make up the audio part 
of the diffusion system. 

Three Motu AVB switches connect all the audio interfaces to¬ 
gether, and the different racks and mobile units in the space are eas¬ 
ily connected through long ethemet cables (one mobile rack for the 
digital mixer and associated equipment, another for the OpenMixer 
control computer and another one for the desktop computer). The 
use of ethernet means there is a significantly smaller cable count to 
manage 64 channels of audio. 



A3X speakers 


A3X speakers 


A3X speakers 


1-8 : 8 subs 
9-16 : 8 A77X 

8 DBX xovers 


8 DBX xovers 
8 upper speakers 


Figure 7: Signal routing in OpenMixer Mode 


the software was designed so that either of them can be used to con¬ 
trol the system and they stay synchronized with each other. 

5.7. What? More Speakers? 

Quite early in the implementation process Christopher Jette pushed 
for the immediate inclusion of something we had planned as a future 
expansion. In addition to the existing subwoofer and main speaker, 
the eight main towers would house 8 speakers almost hugging the 
ground. These speakers were included to help “pull down” the sound 
image, specially in the Ambisonics decoder modes. So our final 
speaker count is 56 speakers and 8 subwoofers, adding up to 64 in¬ 
dividual outputs. We are maxed out. 


5.6. Switching modes 

The attentive reader might have noticed that switching between “Dig¬ 
ital Mixer Mode” and “OpenMixer Mode” seems to be happening 
magically so far. While we do have a Linux control computer, we 
cannot rely on it for switching modes. The system should keep work¬ 
ing even if the control computer is off, or if it breaks down. 

A solution that has worked admirably well is to add yet another 
computer (as if the system was not complex enough). This addi¬ 
tional computer is a RaspberryPI 3 with a touch panel, mounted right 
next to the digital mixer. It allows the user to switch sampling rates, 
switch between operating modes and even activate different options 
in “OpenMixer mode” (changing between the Direct and Ambison¬ 
ics modes, selecting Ambisonics decoders, etc). It communicates 
through ethernet with all the Motu audio interfaces and the main 
OpenMixer control computer. 

The OpenMixer control computer also has a touch display, and 


5.8. Control Software 

In “OpenMixer Mode” the Linux control computer (currently boot¬ 
ing Fedora and running an optimized RT patched kernel) performs all 
internal DSP using SuperCollider[14] and its Supernova multi-core 
load-balancing sound server[15], Jconvolver [16] is used for very ef¬ 
ficient low latency partitioned convolution, and implements the digi¬ 
tal loudspeaker correction filters. The software itself is conceptually 
simple, it provides for level and delay equalization of all speakers, 
digital crossovers (a combination of Linkiwitz Rayley [17] and But- 
terworth filters), routing control so that different sound sources (digi¬ 
tal mixer, laptop, desktop) can be connected to the speakers, optional 
built-in Ambisonics decoders created with ADT [3][4](up to 6th or¬ 
der) and of course digital equalization of all speakers with convolu¬ 
tion filters created from analyzing their measured impulse responses 
with the DRC (Digital Room Correction [18]) software package. 

Supercollider is started automatically on boot through a systemd 
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unit and takes care of orchestrating the rest of the system startup pro¬ 
cess. First, Jack [10] is started, then the Supercollider program starts 
the Supernova sound server and its associated DSP software, two in¬ 
stances of Jconvolver, and finally everything is connected together 
using aj-snapshot and dynamically generated XML connection files. 
Supercollider monitors all auxiliary programs, and restarts and re¬ 
connects them if they somehow fail. 

The whole system is optimized for low latency, and currently 
runs with 128 frames per period (work is underway to get it to work 
at 64 frames per period, which would start approaching the perfor¬ 
mance of the digital mixer which runs with 64 frame blocks). 

Supercollider is also used for the touch graphical user interface 
in both the main computer and the small RaspberryPi switching ap¬ 
pliance. 

5.9. Calibration 

For best performance the full speaker array is calibrated after the 
initial installation and when hardware changes are made. First el¬ 
evation and azimuth angles for all speakers are measured, as well 
as the distances to the center of the space. These measurements are 
used to create the Ambisonics decoders for the main array and the 
subwoofers, and also to compensate for arrival times at the center 
of the space. After that we use Aliki [19] to measure the impulse 
response of the speakers, and that information is used to calculate 
convolution filters using DRC. Finally SPL measurements are done 
to compensate for small differences in speaker loudness in both di¬ 
rect and Ambisonics modes. 

6. PROBLEMS AND CHALLENGES 

While the selection of Motu products lead to a viable design, there 
are still occasional problems when using them on “unsupported plat¬ 
forms”. 

Occasionally an audio interface can disconnect from one or more 
of its AVB streams. The web interface shows them blinking and we 
have not found a way out of this other than rebooting both interfaces. 
After the reboot the connections are re-established automatically. We 
have not been able to find a way to reproduce this, and it only hap¬ 
pens in the more complex Stage system we are describing in this 
paper (it has not happened, so far, in a far simpler system now run¬ 
ning in our Listening Room). We have to do an thorough audit of the 
existing streams and only enable exactly what we need. This may 
be a problem solved in later firmware releases, but we are chained to 
older ones to retain the features that make the system possible in the 
first place. 

In a different Studio in which we also deployed a single Motu 
interface we found another firmware related problem when using the 
class compliant driver under Linux. Suddenly inputs going into the 
computer through USB would switch channels in blocks of 8. What 
was coming through input 1 is suddenly in input 9, and so on and so 
forth. Again, downgrading to a previous firmware version fixes the 
problem (or using the proprietary driver). Caveat emptor. 

In terms of the Linux control computer for the Stage system, 
the long term solution for interfacing with the audio interfaces is to 
use AVB streams directly. That would lift the 64 channel limitation 
(we of course would like to add a few more speakers), and hopefully 
make the system more reliable. The foundation of that is available 
in the OpenAVNu git repository but much work remains to be done 
(some preliminary tests managed to sync the Linux computer to the 


AVB clock, and get the system to recognize the existence of a Motu 
card). 

6.1. Motu vs. Jack vs. PulseAudio 

A weird feature of the Motu interfaces is that every time the sampling 
rate is changed (even if it is an internal change and the card is not 
slaved to an external clock) it takes the card a few seconds to acquire 
a “lock”. During this time Jack can try to start, but at some point it 
decides that it can’t, and fails. 

This can lead to an endless loop of failed starts in the following 
scenario: assume the card is already running at 44.1 KHz and we 
are trying to start Jack at 48KHz. Jack requests exclusive access to 
the card front PulseAudio and the request is granted. Jack tries to 
start but fails, because the card was running at 44.1 KHz and it takes 
time to switch to 48KHz. After the attempt the card is switching to 
48KHz, but when Jack quits it hands the card back to PulseAudio, 
which promptly resets its sampling rate to its default, 44.1 KHz. And 
we are back where we started. There is no way to start Jack, unless 
PulseAudio is killed or its default sampling rate is changed to the 
one we want, or we tell it to ignore the card, which is not what we 
want to do. 

If there is no change in sampling rate and Jack fails to start, 
waiting a few seconds and trying again succeeds. 

To avoid this problem, in the control software for both the Lis¬ 
tening Room and Stage Linux computers we use a JSON http call to 
check the lock status of the audio interface clock and delay the start 
of Jack until the sampling rate is locked. 

7. CONCLUSIONS 

The opening concerts of the newly upgraded Stage took place in Oc¬ 
tober 4/5 2017, and the system performed very well (at the time we 
were still using the old subwoofers). Another round of upgrades in 
2018 replaced the original subwoofers with newer ones, as outlined 
above, and also upgraded the main 8 speakers with newer A77X 
Adam monitors. The lower layer of speakers were repositioned at 
the bottom of the main towers, and the new subwoofers were stacked 
immediately above them (originally they had been reversed). A sec¬ 
ond round of successful concerts (our annual Transitions concerts) 
took place in October 2018 with the fully upgraded array. The full 
array has seen more use in the past year, with several concerts using 
it instead of what would have been stereo or quad diffusion. 

We have outlined the design process of a complex Linux-based 
diffusion system, using off-the-shelf components and GNU/Linux 
for all the software components. 
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nical Director, also spent many hours helping with big and small 
details. Many students helped, in particular thanks to Megan Jurek, 
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Figure 8: Transitions 2018 concert 


who spent many hours soldering many many small connectors, and 
routing what seemed like miles of cables. No audio would flow if 
not for her help. Jay Kadis, our audio engineer at the time, also 
spent quite a bit of time wiring DB25 connectors and cabling the 
main towers. Juan Sierra, one of our MA/MST students, was instru¬ 
mental in properly phase matching of the new subwoofers with the 
main speakers and tuning the crossovers for best performance, the 
Stage sounds much better thanks to him. Carlos Sanchez, sysadmin 
and staff at CCRMA, designed and implemented the hardware and 
software that drives the touch interface that controls the whole sys¬ 
tem. And Constantin Basica, our new concert coordinator, has been 
helping visiting artists use the full system for much more interesting 
concerts over the past year. Many thanks to all involved, we can now 
do justice to many fantastic pieces from composers that tickle our 
ears with beautiful sounds arranged in space. 
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Figure 9: Transitions 2017 concert 
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ABSTRACT 

We present the ability of our Spatial Audio Toolkit for Immersive En¬ 
vironment (SATIE) to render simultaneously real-time audio scenes 
composed of various spatialization methods. While object oriented 
audio and Ambisonics are already included in SATIE, we present a 
prototype of a directional reverberation method based on Impulse 
Response computation and describe how this method will be in¬ 
cluded in SATIE. 

1. INTRODUCTION 

A growing number of computer music performance venues are now 
equipped with large loudspeaker configurations [1], and therefore 
provide new opportunities for artists using 3D audio scene environ¬ 
ments for composition and sound design. This, along with the recent 
rise of affordable spatial audio recording devices and increased inter¬ 
est in virtual reality experiences, gives rise to a growing need of com¬ 
bining multiple spatialization methods: captures (live or not) made 
in different ambisonic formats, mono object-based audio sources as 
well as flexible & adaptable speaker configurations. We anticipate 
the evolution of spatial audio composition — targeting perform¬ 
ing arts, installations or any other immersive experiences — involv¬ 
ing different types of audio sources such as live audio capture, field 
recordings and synthetic audio, and where visual[2] and haptic[3] 
correlates with the audio part. 

Moreover, innovation from the game industry is pushing forward 
virtual and augmented realities, approaching spatial audio with an 
object oriented manner: sources are sound objects, located in space 
and controlled with low level parameters such as gain, equalizer and 
spread. This approach, although effective for speaker array systems, 
is missing architectural acoustical responses and adapts poorly to 
non clearly located sound sources such as the sound of a river. The 
3D graphic world is now entering audio and provides methods for 
the simulation of sound based on physics of soft body vibration and 
sound propagation [4], Although such simulations are probably hard 
to achieve in real-time, simulations of acoustic responses of 3D envi¬ 
ronment may improve significantly the coherence of the integration 
of audio sources with the virtual space, while still allowing a real¬ 
time & 6-DoF navigation [5], The use of ray tracing algorithms for 
real-time rendering is appropriate [6] and has the advantage of in¬ 
cluding the direction of the sound during auralization [7], allowing 
real-time calculation of directional sound reflections. 

One of the main challenges today for spatial audio render is to 
support the multiplicity of the i) audio display methods, ii) spatial 
audio algorithms and iii) spatial audio authoring and 6-DoF naviga¬ 
tion in spatial audio [8], To date however, many existing real-time 
3D audio scene rendering systems, such as COSM [9], Blender- 
CAVE [10], Spatium [11], Zirkonium [12], CLAM [13], 3Dj [14], 
Panoramix [15] and the spatDiff library [16] mostly focus on trajec¬ 
tory based composition with object oriented audio and sound fields 



Figure 1: Example of an augmented reality application where a com¬ 
bination of several spatialization algorithms (ambisonics and object 
oriented audio): a 360° audiovisual capture is rendered simultane¬ 
ously with synthetic objects, the bubbles coming out from the white 
vase. 


with ambisonics. The challenge of navigating in heterogeneous spa¬ 
tial audio content is illustrated with Figure 1, where the spatial audio 
scene is constituted from 360° audio/video footage where the sound 
field captured using an ambisonic microphone 1 is mixed with syn¬ 
thetic audio is spatialized through an object oriented approach and 
correlated with 3D objects on screen (the white bubbles coming out 
from the white vase). 

In this paper, we present how our Spatial audio Toolkit for Im¬ 
mersive Environments (SATIE 2 ) addresses the challenge of several 
approaches to audio scene rendering, possibly combining simultane¬ 
ously object based audio, ambisonic formats and architectural based 
acoustical spatialization. 

2. SATIE 

The development of SATIE (with the Supercollider language [17]) 
was first motivated by the need to render dense and rich audio scenes 
the Satosphere, a large dome-shaped audiovisual projection space at 
the Society for Art and Technology [SAT] in Montreal, and to com¬ 
pose real-time audio/music scenes consisting of hundreds of simul¬ 
taneous sources targeting loudspeaker configurations of 32 channels 
or more, and sometimes with two or more different audio display 
systems [18], In fact, SATIE easily adapts to different audio display 
configurations and supports plugins architecture which makes it eas¬ 
ily extensible to new situations. As such, it fills the role of a rapid 

1 The Zylia ZM-1 microphone. 

2 https : //gitlab. com/sat-metalab/satie, accessed Dec. 
2018 
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Figure 2: Example pipeline of a spatial rendering involving hetero¬ 
geneous audio sources: reverberating sources, sound field sources 
and sound object. 



prototyping tool for spatial audio composition. 

Control of sound sources in SATIE is done through unified OSC [19] 
messages allowing for life management of each sound sources, along 
with (possibly custom) parameters control. 

3. RENDERING METHODS 

Facing a variety of approaches to composition with dense audio struc¬ 
tures and a variety of audio displays, SATIE implements a flexible 
rendering pipeline allowing mixing of different audio input formats 
and multichannel mastering and is easily adaptable to various audio 
displays. We rely mainly on Supercollider’s supernova rendering 
engine for multi-threading operation. Consequently, we have access 
to parallel groups[20] which solve some real-time related issues with 
synth instantiating and bus allocation. SATIE structures different 
types of audio processors in layers, represented by a hierarchy of 
parallel groups (ParGroups): 

• audio sources 

• effects 

• post-processors. 

Audio sources are different types of mono or multichannel audio 
generators and players. On the second level are effects which usu¬ 
ally do not generate sound but modify the signal of audio sources. 
Finally, post-processors are meant as mastering stage, where the fi¬ 
nal stages of DSP are done. In the actual implementation, the post¬ 
processors are divided in two groups: one for b-format signals and 
one for traditional mono/multichannel signals. 

The signals between audio sources and effects pass through busses, 
i.e. the user allocates auxiliary busses and manages the bus access 
on both, the generator and effect side. If any post-processors are 
present, all signals are collected there, otherwise, they bypass di¬ 
rectly to the spatializer. Multiple spatializers can be used, in which 
case SATIE will create appropriate number of output channels. 

Figure 2 shows a rendering pipeline that combines object based 
audio sources, sound field sources and reverberating sources into het¬ 
erogeneous mix. 


3.1. Object Based Audio 

Object audio (Figure 3) is what is most commonly used in various 
entertainment industries where a sound source has a clearly defined 
position within the coordinate system [21]. SATIE supports different 
types of object based audio sources, such as mono audio, mono live 
input sources and synthesized sounds [22], The spatializers handling 
object audio expect azimuth, elevation and gain for panning each 
audio object. 

SATIE was initially designed to render large numbers of mono 
audio sources, optionally with effects, to large multi-channel loud¬ 
speaker systems. Audio sources and effects can be placed in groups 
and controlled either per group or on individual basis. Similarly, 
spatializers take mono signals and place them on different chan¬ 
nels according to azimuth, elevation and gain parameters. The post¬ 
processing audio object is comparable to mastering effects in a stu¬ 
dio or live pipeline, typically limiting, compressing or normalizing 
signals. 

While all parameters (audio object specific as well as spatial- 
ization) can be modified either directly from the Supercollider lan¬ 
guage, SATIE supports OSC and our preferred method is using a 3D 
engine for “volumetric” control of the sources as well as actual ge¬ 
ometry computation. In line with this object based approach and load 
balancing physical computation we were able to use particle swarms 
of hundreds simultaneous sound sources. 

3.2. Ambisonics 

Ambisonic pipeline, implemented via SC-HOA plugins/quark 3 (Fig¬ 
ure 4(a)) provides means to play multichannel files, live audio inputs, 
encode mono signals into b-format signals and transcode between 
different ambisonics formats (ACN and FuMa). It supports b-format 
up to order 5. 

SATIE supports ambisonics with the same approach to signal 
path. The ambisonic audio input can be sent to ambisonic effects and 
post-processors such as rotation, mirroring, and beamforming filter¬ 
ing. The significant cost of ambisonic decoding is payed only once 
since not embedded in each ambisonic source pipeline, but rather at 


3 https://github.com/florian-grond/SC-HOA 


52 










Proceedings of the 17 ,h Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23-26, 2019 



Figure 4: SATIE pipeline involving ambisonics 


the post-processor stage. We can also transcode between different 
ambisonic orders. 

3.3. Reverberating Sources with Convolution Reverb 

Having various audio rendering methods driven by 3D engines opens 
doors to the desire of simulating acoustic spaces. Consequently, we 
have started developing a tool for real-time generation of impulse re¬ 
sponses through ray tracing with the idea of integrating the IR work- 
flow with SATIE. Figure 5(a) shows a screenshot of a real-time ren¬ 
dered frame where the listener is facing a sound source represented 
by a cube at the end of the hallway. Figure 5(b) shows a wireframe 
view of a simple model (not related to the picture on the left) show¬ 
ing what is actually going on. The black dots on the inner faces of 
the model represent the impact points of the rays on the walls of a 3D 
model. Sound sources and the listener are not shown, it simply shows 
a point cloud mapped on the model for reference. This implementa¬ 
tion uses another custom software, VARAYS 4 , which shares the 3D 
model with the 3D engine (in this case we're using ElS), receives the 
coordinates of the sound sources and the listener and writes IR files 
to disk. The IR files are read by SATIE which continuously replaces 
the buffer read by Supercollider’s PartConv UGen. A crude proto¬ 
type of this process (using mono convolution) is demonstrated in the 
following video https : //vimeo. com/30 62 02 4 41. 

Besides mono IR, we can also generate Ambisonic IR (AIR), 
although at the time of the writing, this process has not yet been 
integrated into SATIE. 

4. CONCLUSION 

This paper outlined some of our approaches to heterogeneous audio 
scenes consisting of different types of audio input sources and multi¬ 
channel displays. We described some SATIE functionalities with re¬ 
gard to heterogeneous spatial audio scenes. We have also described 

4 https://gitlab.com/sat-metalab/varays 


our approach to Ambisonic Impulse Response (AIR) in VARAYS in 
order to enable ambisonic acoustic simulation. VARAYS is still at 
very early stages of development, it needs proper support for mate¬ 
rial based diffraction and diffusion. Figure 4(b) shows the general 
workflow, where AIR is applied to a mono sound source and is spa- 
tialized using the usual SATIE pipeline. There is still some work left 
to do in order to fully integrate vaRays into SATIE pipeline (both 
IR and AIR). One of the areas to explore is in the interpolation of 
IR instances in order to compensate for real-time changes in the lis¬ 
tener and the sound source location. This process can be mixed with 
types of rendering which provides sufficient creative liberty to the 
user. There is also some work left to provide IR and AIR to SATIE 
as files I/O are not the most optimal. We will be looking into sending 
OSC blobs. Another path would be sharing buffers between SATIE 
and vaRays using out shared memory library SHMDATA 5 . Another 
desired functionality is rendering VBAP spatialization into b-format 
signals. 
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(a) The cube floating at the end of the hallway in the 18th century Paris 
model represents a sound object. The image is from a prototype developed 
by Metalab using ElS for visual rendering and navigation, VARAYS for real¬ 
time impulse response processing and SATIE for audio spatialisation. 



(b) Visualisation showing the impacts (black dots) of sound sources (not 
shown) on the walls of a 3D volume for a listener (not shown) placed inside 
the same volume. For this example, 2000 rays were thrown with a maximum 
of 3 reflections. A point cloud representing the impacts was saved by our 
software VARAYS and rendered in BLENDER. 


Figure 5: Example of directional reverberation approach with our prototype based on conjoint use of SATIE and vaRays. We used the Bretez 
3D of the 18 th century. 
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ABSTRACT 

This paper presents a real-time additive sound synthesis appli¬ 
cation with individual outputs for each partial and noise component. 
The synthesizer is programmed in C++, relying on the Jack API for 
audio connectivity with an OSC interface for control input. These 
features allow the individual spatialization of the partials and noise, 
referred to as spectro-spatial synthesis, in connection with an OSC 
capable spatial rendering software. Additive synthesis is performed 
in the time domain, using previously extracted partial trajectories 
from instrument recordings. Noise is synthesized using bark band 
energy trajectories. The sinusoidal data set for the synthesis is gen¬ 
erated from a custom violin sample library in advance. Spatialization 
is realized using established rendering software implementations on 
a dedicated server. Pure Data is used for processing control streams 
from an expressive musical interface and distributing it to synthe¬ 
sizer and renderer. 

1. INTRODUCTION 
1.1. Sinusoidal Modeling 

Additive synthesis is among the oldest digital sound creation meth¬ 
ods and has been the foundation of early experiments by Max Math¬ 
ews at Bell Labs. It allows the generation of sounds rich in timbre, 
by superimposing single sinusoidal components, referred to as par¬ 
tials, either in the time- or frequency domain. Based on the Fourier 
Principle, any quasi-periodic signal y(t) can be expressed as a sum 
of N part sinusoids with varying amplitudes a n (t) and frequencies 
oj n ( t ) and an individual phase offset p n : 

Np ar t 

y(t)= Y a n (t) sin(uj n (t) t + ifin) ( 1 ) 

n= 1 

In harmonic cases, which applies to the majority of musical in¬ 
strument sounds, the partial frequencies can be approximated as in¬ 
teger multiples of fo: 

Npart 

y(t)= Y a„(t) sin(2 n n fo(t) t + ip n ) (2) 

n =1 

Although relative phase fluctuations are important for the per¬ 
ception [1], the original phase can be ignored in many cases, which 
is of benefit for manipulations of the modeled sound: 

N pa rt 

y(t ) = I] a n (t) sin(2 n n fo(t) t) (3) 

n=1 

Based on this theory, an algorithm for speech synthesis has been 
proposed by McAulay et ah [2]. For musical sound synthesis the 
algorithm has been added a noise component [3], resulting in the 


sinusoids+noise model. The signal is then modeled as the sum of the 
deterministic part Xdet and the stochastic part x s toch, also referred 
to as residual: 

X = Xdet + Xstoch (4) 

Modeling of residuals can for example be performed by approx¬ 
imating the spectral envelope using linear predictive coding [3] or a 
filter bank based on Bark frequencies [4], The phase of the stochastic 
signal is random, in theory, and thus needs not be modeled. However, 
residuals usually are not completely random since they still contain 
information from the removed harmonic content. 

In order to fully model the sounds of arbitrary musical instru¬ 
ments, a transient component Xtrans is included [4] in the full signal 
model. This component captures plucking sounds and other percus¬ 
sive elements: 

X — Xdet 4“ Xstoch 4“ Xtrans (5) 

Since the work presented in this paper focuses on the violin in 
legato techniques, the transient component can be neglected without 
impairing the perceived quality of a re-synthesis. 

1.2. Spectral Spatialization 

In electronic and electroacoustic music, the term spectraI spatializa¬ 
tion refers to the individual treatment of a sound’s frequency compo¬ 
nents for a distribution on sound reproduction systems [5]. Timbral 
sound qualities can thusly be linked to the spatial image of the sound, 
even for pre-existing or fixed sound material. In the case of spectro- 
spatial synthesis, this process is integrated on the synthesis level,for 
example in additive approaches. This is not yet a common feature 
in available synthesizers, but several research projects have been in¬ 
vestigating the possibilities of such approaches with applications in 
musical sound processing, sound design, virtual acoustics and psy¬ 
choacoustics. 

Topper et al. [6] apply additive synthesis of basic waveforms 
(square wave, sawtooth), physical modeling and sub-band decompo¬ 
sition in a multichannel panning system with real time, prerecorded 
and graphic control. Their system is implemented in MAX/MSP and 
RTcmix, running on both Mac and PC/Linux hardware with a total 
of 8 audio channels. 

Verron et al. [7] use the sinusoids + noise model for spectral 
spatialization of environmental sounds. Each component can be syn¬ 
thesized with individual position in space on Ambisonics and Binau¬ 
ral systems. Deterministic and stochastic components are composed 
and added together in the frequency domain and subsequently spa¬ 
tially encoded with a interbank. Control over the synthesis process 
is depending on the nature of the environmental sounds [8], 

In the context of electroacoustic music, James [9] expands Den¬ 
nis Smalley’s concept of spectromorphology to the idea of spatiomor- 
phology. Timbre Spatialization is achieved using terrain surfaces 
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Figure 1: Partial amplitude trajectories of a violin sound 
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Figure 3: Unwrapped partial phases of a violin sound 
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Figure 2: Partial frequency trajectories of a violin sound 
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Figure 4: Bark band energy trajectories of a violin sound 


and by mapping these to spacio-spectral distributions. Max-MSP 
is used for computing the contribution of spectral content to individ¬ 
ual speakers with Distance-based amplitude panning (DBAP) and 
Ambisonic Equivalent panning (AEP) methods. 

Spectral spatialization can also be used to synthesize dynamic 
directivity patterns of musical instruments in virtual acoustic envi¬ 
ronments. Since the directivity in combination with movement has 
a significant influence on an instrument’s sound, this can increase 
the plausibility. Warusfel et al. [10] use a tower with three cubes, 
each containing multiple speakers, to spatialize frequency bands of 
an input signal for the simulation of radiation patterns. 

1.3. The Presented Application 

The presented application incorporates different synthesis modes, of 
which only the so called deterministic mode will be subject of this 
paper. In this basic mode, precalculated parameter trajectories, as 
presented in Sec. 2, are used for a manipulable resynthesis of the 
original instrument sounds. 

The software architecture is designed to allow the use of addi¬ 
tive synthesis, respectively of sinusoidal modeling, on sound field 
synthesis systems or other reproduction setups. This is achieved by 
providing individual outputs for all partials and noise bands in an 
application implemented as a JACK client, described in Sec. 3. Us¬ 
ing JACK allows the connection of all individual synthesizer output 


channels to a JACK-capable renderer, such as the SoundScape Ren- 
derer (SSR) [11], Panoramix [12] or the HOA- Library [13]. Making 
each partial a single virtual sound source in combination with these 
rendering softwares, the spatial distribution of the synthesis can be 
modulated in real-time. Pure Data [14] is used to receive control 
data from gestural interfaces or to play back predefined trajectories 
for generating control streams for both the synthesizer and the spa¬ 
tialization renderer. A direct linkage between timbre and spatializa¬ 
tion is thus created, which is considered essential for a meaningful 
spectro-spatial synthesis. 

2. ANALYSIS 

The TU-Note Violin Sample Library [15], [16], is used as audio con¬ 
tent for generating the sinusoidal model. Designed in the style of 
classic sample libraries, this data set contains single sounds of a vio¬ 
lin in different pitches and intensities, recorded at an audio sampling 
rate of 96 kHz with 24 Bit resolution. 

Analysis and modeling is performed beforehand in Matlab, us¬ 
ing monophonic pitch tracking and subsequent extraction of the par¬ 
tial trajectories by peak picking in the spectrogram. YIN [17] and 
SWIPE [18] are used as monophonic pitch tracking algorithms. Based 
on the fO-trajectories, partial tracking is performed with STFT, ap¬ 
plying a hop-size of 256 samples (2.7ms) and a window size of 
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Figure 5: Sequence diagram for the jack callback function 


4096 samples, zero-padded to 8192 samples. Quadratic interpola¬ 
tion (QIFFT), as presented by Smith et al. [19], is applied for peak 
parameter estimation of up to 80 partials. Due to the sampling fre¬ 
quency, the full number of partials is only analyzable up to the note 
D 5 (576.65 Hz) 

By subtracting the deterministic part from the complete sound in 
the time domain, the residual signal is obtained. The residual is then 
filtered using a Bark scale filterbank with second order Chebyshev 
bandpasses and the temporal energy trajectories are calculated for 
the resulting 24 band-limited signals. At this point, a large amount 
of information is removed from the residual signal. Due to the short¬ 
comings of the time domain subtraction method, the residual still 
contains information front the deterministic component. By averag¬ 
ing the energy over the Bark bands, this relation is eliminated. 

Results of the synthesis stage are trajectories of the partial am¬ 
plitudes, as shown in Figure 1. the trajectories of partial frequencies 
and phases, as shown in Figure 2, respectively Figure 3 as well as the 
trajectories of the Bark-band energies, illustrated in Figure 4. The 
resulting data is exported to individual YAML Hies for each sound, 
which can be read by the synthesis system. 

3. SYNTHESIS SYSTEM 

3.1. Libraries 

The synthesis application is designed as a standalone Linux com¬ 
mand line software. The main functionality of the synthesis system 
relies on the JACK 1 API for audio connectivity and the liblo 1 , respec¬ 


tively the liblo C++ wrapper for receiving control signals, libyaml- 
cpp 3 is used for reading the data of the modeled sounds and the rel¬ 
evant configuration files, libsndfile 4 for reading the original sound 
files, as well as the libjftw 5 are included but not relevant for the as¬ 
pects presented in this paper. Frequency domain synthesis and sam¬ 
ple playback are partially implemented but not used at this point. 

3.2. Algorithm 

Both the sinusoidal and the noise component are synthesized in the 
time domain, using a non-overlapping method. For the sinusoidal 
component, the builtin sin () function of the cmath library and a 
custom lookup table can be selected. The choice does not affect the 
overall performance, significantly. The filter bank for the noise syn¬ 
thesis consists of 24 second order Chebyshev bandpass filters with 
fixed coefficients, calculated before runtime. The amplitude of each 
frequency band is driven by the previously analyzed energy trajecto¬ 
ries. 

During synthesis, the algorithm reads a new set of support points 
from the model data for each audio buffer and increments the posi¬ 
tion within the played note. Figure 5 shows a sequence diagram 
for the deterministic synthesis algorithm, starting at the JACK call¬ 
back function, which is executed for each buffer of the JACK audio 
server. Since the synth is designed to enable polyphonic play, the 
voice manager object handles incoming OSC messages in the func¬ 
tion update_voices () to activate or deactivate single voices. 

3 https://github.com/jbeder/yaml-cpp/ 

4 http://www.mega-nerd.com/libsndfile/ 

5 http://www.fftw.org/ 


'http://jackaudio.org/ 

2 https://github.com/radarsatl/liblo 
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Audio Out 


Figure 6: Combination of synthesizer and Tenderer on separate ma¬ 
chines using Pure Data for synth configuration and parameter parsing 


For the synthesis of mostly monophonic, excitation continuous in¬ 
struments like the violin, the polyphony merely handles the overlap¬ 
ping of released notes. Subsequently, the voice manager loops over 
all active voices in the function getNextFrame_TD (), first set¬ 
ting the new control parameters for each voice. 

In cycle_start_deterministic (), support points for 
all partial's parameters are picked at the relevant voice’s playback 
position. These support points are then linearly interpolated over the 
buffer length in set_interpolator (). 

Finally, in getNextBlock_TD (), each single voice gener¬ 
ates the output for all sinusoids and all noise bands in two separate 
vectorizable loops, adding both to the output buffer. 

3.3. Runtime Environment and Periphery 

The runtime system for the synthesis is starting a JACK server with 
48 kHz sampling rate, a buffer size of 128 samples and 2 periods 
per buffer. This results in 5.3 ms latency for the audio playback, 
which is within the limits for this synthesis approach. On an Intel(R) 
Core(TM) i7-5500U CPU @ 2.40GHz with disabled speed-stepping 
and a Fireface UFX, the JACK server is showing an average load of 
approximately 20%. 

The interaction of the involved software components is visual¬ 
ized in Figure 6. For reasons of performance and increased flexibility 
in the studio, two separate machines are used for synthesis and spa- 
tialization. Connectivity between the systems is realized with MADI 
or DANTE, using individual channels for the 80 partials and 24 noise 
bands. 


3.4. Control 



Figure 7: Spatialization scene in a 2D setup with 30 partials and their 
positions 

The control data for the partial positions in the rendering soft¬ 
ware is not generated in the synthesis system at this point and is 
managed, externally. This offers more flexibility for testing different 
mappings at this stage of development. A Pure Data patch is used to 
receive incoming control messages, either front OSC or MIDI, and 
distribute them to the synthesizer and the spatialization software. For 
live performance, the patch receives continuous control streams for 
pitch and intensity from an improved version of the interface pre¬ 
sented by von Coler et al. [20] and visualizes the sensor data. Pitch 
and intensity are forwarded to the synth, directly. Additionally, data 
from several Force Sensitive Resistors (FSR) and a 9 degrees of free¬ 
dom IMU, which can be used for controlling the spatialization, is 
sent to the patch. 

Figure 7 shows an example for a simple spatialization mapping 
on a 2D system. The absolute orientation of the IMU is used to con¬ 
trol the general direction <p of the partial flock. A second parameter 
S, derived front the intensity and additional sensor data, controls the 
spread of the partials around this angle, depending on the partial in¬ 
dex. 

4. CONCLUSION 

After significantly improving the performance of the synthesis sys¬ 
tem, the application can now be used with the full 80 partials and 
24 Bark bands as individual outputs. Recent tests in combination 
with different spatial rendering softwares and different loudspeaker 
setups show promising results. However, the dynamic spatialization 
of such number of virtual sound sources and the resulting traffic of 
OSC messages is demanding for the runtime system. Using separate 
machines for synthesis and rendering reduces the individual load. 
The number of rendering inputs can also be reduced without limit¬ 
ing the perceived quality of the spatialization. Multiple partials may 
share one virtual sound source. 
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Next steps are now possible, which include the empirical inves- [12] 
tigation of mappings from controller sensors to both the spectral and 
spatial sound properties. This includes user experiments to evalu¬ 
ate different mapping and control paradigms, as well as perceptual 
measurements of the synthesis results. 
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ABSTRACT 

Digital waveguides and highly-resonant filters are potential funda¬ 
mental building blocks of physical models and modal processors. 
What might a sound designer accomplish with a massive collection 
of these objects? 

When the building blocks are independent, the overall system 
becomes highly parallel. We investigate the feasibility of using a 
modern Graphics Processing Unit (GPU) to ran collections of waveg¬ 
uides and filters, toward constructing realtime collections of room 
simulations or instruments made up of many banded waveguides. 

These two subproblems offer different challenges and bottle¬ 
necks in GPU acceleration: one is compute-bound while the other 
has memory optimization challenges. 

We find that modern GPUs can run these algorithms at audio 
rates in a straightforward fashion-that is, sample-by-sample without 
needing to implement transforms that allow computation of subse¬ 
quent time samples concurrently. While a fully-realized instrument 
or effect based on these building blocks requires additional process¬ 
ing and will have more data dependencies that reduce parallelism, we 
find that consumer-GPU-accelerated audio enables a scale of real- 
time models which would be intractable on contemporary consumer- 
CPUs. 

1. INTRODUCTION 

Potential applications for a large number of modal filters or digital 
waveguides include: 

• A large collection of coupled acoustic spaces, for example an 
opera house with listening booths that may be seen as res¬ 
onators, or the interior architecture of ancient Chavm[l], 

• A virtual orchestra where we have many players, each using 
an instrument made up of several digital waveguides. 

• A virtual reality simulation where a server may track room- 
and position-dependent modal reverberators for a number of 
participants on low-power client devices. 

• A drum set made up of a couple dozen individual instruments, 
each using many modal filters. 

While the first three ideas are hypotheticals enabled by having 
access to massively parallel filtering/waveguide systems, the fourth 
exists as a real-world proof of concept to synthesize a dozen modal 
cymbal models at realtime rates using a GPU. Active work is toward 
adding realtime controls for a performer. 

1.1. Building Blocks: Modal Synthesis and Digital Waveguides 

Modal synthesis involves determining the natural resonant modes of 
a vibrating object, and using the appropriate frequencies, amplitudes, 
and decay rates to build a system that simulates the original sound. 


A filter bank of high-Q filters is often used for such sound syn¬ 
thesis, and is also the backbone of modal reverberators)?]. The more 
modes we can compute at realtime audio rates, the higher the fidelity 
of the sound, and the more sources or rooms we may model. 

Digital waveguides[3] are efficient for simulation of traveling 
waves, and with scattering junctions and nonlinearities added, a wide 
range of physically-accurate bowed strings, brass, etc. may be sim¬ 
ulated with robust realtime performance controls. Here, we are in¬ 
terested in working toward many virtual performers each playing an 
independent instrument (orchestra), or one performer given control 
over simultaneous but mostly independent “clusters” of waveguide- 
powered instruments, such as a virtual drum set with a large number 
of pieces. 

Digital Waveguides may be implemented efficiently in the 1- 
dimensional case via a bidirectional delay line representing two trav¬ 
eling waves, plus filters to account for dispersion loss. These are 
the basic building blocks we seek to accelerate, noting that for more 
complex physical models we will add scattering junctions, additional 
filtering, and nonlinear elements incurring additional computation 
cost. In some cases, such as piano string modeling[4], some terms 
may be commuted, or combined with an impulse response, to add 
complexity to the overall model without scaling the overall steady- 
state computational cost. 

1.2. GPU Acceleration for Audio Algorithms 

For years, graphics processing units (GPUs) have supported both 
high-level realtime graphics APIs as well as lower-level, general- 
purpose computational APIs. GPU acceleration of audio synthe¬ 
sis and audio effect algorithms has been shown to yield substantial 
speedups on certain algorithms. GPUs advance in performance each 
generation in terms of parallel core count and base core speed, so 
we expect some previously intractable problems to become tractable 
over time. 

Among papers in the literature: 

Savioja et. al.[5] give an overview of potential audio tasks that 
may be accelerated via GPGPU programming at audio rate and rea¬ 
sonable buffer sizes for realtime performance. Sinusoid-based addi¬ 
tive synthesis obtained 250x+ speedup over CPU implementations. 
FFTs running on a GPU were able to be eight times as long as those 
running on a CPU-based implementations, and FIR filters were able 
to be 130 times as long as their CPU counterparts. In [5] and [6], the 
authors showed it was possible to synthesize 1.9 million sinusoids in 
realtime, a 1300x speedup over a serial lookup table computation on 
one CPU. This was on a GPU that is six generations behind ours and 
three major GeForce architecture revisions behind our card 1 . And 
we note that our graphics card is itself now a generation and ma¬ 
jor architecture advance behind the times. This work results in a 

1 Fermi (GTX 480, 2010) Kepler Maxwell -> Pascal (GTX 1080Ti, 
2017); RTX cards released in 2018 use the Turing architecture. 
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realtime sound canvas with more “paint” than previously available 
(how might an artist use 1.9 million partials in additive synthesis?). 
The million sinusoids example also demonstrates that maximum per¬ 
formance requires tuning and knowledge of the specific underlying 
hardware. 

Trebien et. al.[7] use modal synthesis to produce realistic sounds 
for realtime collisions between objects of different materials. Not¬ 
ing that HR filters do not traditionally perform well on GPUs, due to 
dependence on prior state not mapping well to the parallel nature of 
GPUs, they introduce a transform to change this into a linear convo¬ 
lution operator and to unlock time-axis parallelism. 

Belloch et. al. [8] accelerate HR filters on the GPU directly by 
using the Parallel HR representation. They achieve 1256 concurrent 
256th-order HR filters at audio rates and sub-millisecond latency at 
a 44.1kHz sampling rate. 

Subsequently, Belloch covers GPU-accelerated massively paral¬ 
lel filtering in [9], and Belloch et. al.[10] leverage GPU accelera¬ 
tion to implement Wave Field synthesis on a 96-speaker array, with 
nearly ten thousand fractional-delay room filters with thousands of 
taps. The maximum number of simulated sound sources is com¬ 
puted for different realtime buffer sizes and space partitions; with a 
256-sample buffer (5.8ms at 44.1kHz), between 18and 198 real-time 
sources could be placed in the field. 

Bilbao and Webbfll] present a GPU-accelerated model of tim¬ 
pani, synthesizing sound at 44.1kHz in a 3D computational space 
within and outside the drum. The GPU approach uses a matrix-free 
implementation to obtain a 30x+ speedup over a MATLAB CPU-, 
sparse-matrix-based prototype, and a greater-than-7.5x speedup over 
single-threaded C code baseline. The largest (and most computationally- 
expensive) drum update equation is optimized to 2.04 milliseconds 
per sample, where the bottleneck is a linear system update for the 
drum membrane. 

Our area of study utilizes recursive filters and unfortunately op¬ 
timizations of the million-sinusoinds and Parallell HR filter works 
do not apply directly; we would like to be able to adjust parame¬ 
ters arbitrarily in realtime and at sample rate, which would require 
rerunning transformation code too often. 

Still, our filter bank is expected to be highly parallel in terms of 
independence between the filters. We may have coupling between 
modes, but so long as it’s limited, we can implement this in a way 
that is compatible with GPU programming ideas. We also do not 
need to implement arbitrary HR filters as in Belloch et. al., but will 
be able to use special-purpose damped oscillation filters that only 
require a first-order complex update equation (see Section 2). 

If GPUs have advanced enough in terms of increased clock rate, 
increased floating-point resources, and lower memory latency in the 
last few generations, we aim to compute filter and physical model 
updates sample-by-sample in realtime. 

1.3. GPU Programming 

Next, we present a brief overview of GPU programming, and note 
advantages and challenges versus programming for a general-purpose 
processor. 

Various toolkits exist to develop GPU programs: two of the 
biggest are NVIDIA's CUDA for use with their graphics cards, and 
APIs implementing OpenCL, a more general heterogeneous compu¬ 
tational framework. For the following investigation we use CUDA. 

If readers have any modern NVIDIA card, they may download the 
software developer kit at developer.nvidia.com. 


When starting to port an algorithm to the GPU, we must con¬ 
sider if it has parallelism to leverage. NVIDIA coined the term “sin¬ 
gle instruction, multiple thread” (SIMT) as a variation on the "single 
instruction, multiple data” (SIMD) of vector processors and mod¬ 
ern mainstream processors. If our work is a series of several differ¬ 
ent and dependent computations, we may not be able to achieve a 
speedup. If we can structure it as applying identical operations to 
many points, it is a good candidate for acceleration. 

The core work unit in CUDA is a group of 32 threads, called a 
warp. Each thread in a warp may have its own values for local vari¬ 
ables, but all threads in a warp will always run the same instruction 
simultaneously. 

A warp is executed on a Streaming Multiprocessor (SM). Dif¬ 
ferent graphics cards have different numbers of SMs; a low-power 
embedded device may have two while our graphics card used for the 
trials below has 28. 

A trivial example task would be to take N integer inputs and 
double them. 

There are two main steps involved in this task. First, we write a 
kernel, the code that will run on the GPU. This will accept an array 
of inputs; each thread will index into the array, find the element it is 
to double, multiply it by 2, and store it in an output array. Second, we 
write host (CPU) code that calls a CUDA function to send an input 
array to the GPU, execute the kernel, wait for the kernel to complete, 
and finally copy the output values back to the CPU, for example so 
we can save them to disk. 

If we have 32 inputs to double, CUDA will execute our kernel 
code on one warp of 32 threads. All 32 threads in that warp execute 
in lockstep and run the same instructions, but obtain a different value 
of the array to double and a different output location to store the 
result. If we have only 15 inputs to double, this is not a problem. 
We will still run on one warp, and the 17 threads without any work 
to do effectively get a break (they technically are issued instructions 
but do not write to memory or compete for resources). If we have 
33 inputs to double, we outgrow one warp. Threads will be grouped 
in one warp of 32 threads and a second warp of one solitary thread. 
More than one warp can run at a time, so it is very likely we will run 
in the same time as it took to run the 32 and 15 input cases. 

A CUDA-enabled graphics card has some number of Stream¬ 
ing Multiprocessors (SMs). The product specifications for individ¬ 
ual graphics card models and a capabilities table such as provided in 
the CUDA Programming Guide lets authors know how many threads 
may be in flight per SM, and how many may actually get run each 
cycle. 

For example, our graphics card may have up to 64 warps as¬ 
signed to each SM (—> 64warps * 32threads/warp = 2048threads), 
though only 4 warps (128 threads) may be scheduled on each single 
clock cycle. 

The programming guide lists other bottlenecks and numbers to 
consider. One piece of information very relevant to us is the number 
of simultaneous arithmetic operations available. 

The graphics card we use is a consumer card meant for gaming. 
Some other cards (the NVIDIA TITAN for example) are targeted 
more for enterprise and scientific computing uses, albeit at a signifi¬ 
cantly higher price point. 

We note the issue rate of floating-point operations from the pro¬ 
gramming guide: 

This means that if we require 64-bit precision, our consumer 
card is more likely to be bottlenecked by this figure than the enter¬ 
prise card in the lineup. 

On the other hand, we note that our card has higher per-clock 
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Table 1: Throughput ofFP instructions (Results per Clock per SM) 



Consumer 

Enterprise 

16-bit mult+add 

2 

128 

32-bit mult+add 

128 

64 

64-bit mult+add 

4 

32 


32-bit floating-point throughput. Assuming that is sufficient pre¬ 
cision for a problem, this means our card may execute 128 32-bit 
multiply-adds per SM * 28 SMs = 3584 multiply-adds per clock. 
For reference, the card’s core clock runs at 1.3-1,5GFlz. 

Thus far we’ve considered parallelism and availability of arith¬ 
metic units; we must also keep the memory hierarchy in mind. 

Each SM has some number of registers. These are very fast. The 
compiler will attempt to use them for thread-local variables. On our 
card, there are 64,000 32-bit registers per SM. 

Each thread block (user-defined organizational unit of threads, 
comprised of one or more waips) has some fast “shared” memory. 
This is 64KB per SM for our setup. We’ll return to this later as an 
optimization. 

All threads may access a card-wide pool of read-only constant 
memory. 

All threads may also access a pool of global memory-11GB on 
our card. However, this is described as having roughly lOOx the 
latency of shared memory or registers. 

If a thread’s local data will not fit in registers, the compiler may 
reduce parallelism or spill to “local memory.” This technically lives 
in the slow global device memory pool, but is backed by a cache. 

1.4. Development Approach 

We take an iterative development approach, getting a basic algorithm 
working and then proceeding to tune it in stages. The CUDA toolkit 
contains IDE plugins and debugging tools, making it straightforward 
to analyze bottlenecks as we encounter them. The compiler will also 
be helping us along the way. 

To set expectations, we know there will be overhead involved in 
transferring data between CPU and GPU, overheads in starting and 
stopping our kernel, and overhead introduced by the host operating 
system. We try to mitigate some of these, but some are unavoidable. 

It is also important to note the significant effort that would be 
involved in moving from this proof of concept to a commercial DAW 
plugin. A hypothetical DAW is competing for CPU resources, will 
be using the GPU to render its GUI (our kernels can run alongside 
that with no issue, but there's still potential resource competition), 
and will force our choice of buffer size and latency. 

1.5. Test Setup 

The test setup consists of: 

• GPU: An NVIDIA GeForce GTX 1080Ti, which is a consumer- 
grade graphics card, though a relatively high-level one. 

• CPU: An Intel i5 3570K running at stock speed. We note 
this CPU is six generations old and a mid-level chip even in 
its generation, and newer CPUs may include newer vector in¬ 
structions including AVX-512. However it is unlikely to bot¬ 
tleneck us, as it is used primarily for memory transfer and 
GPU kernel launches. 


• RAM: CPU has 16GB, GPU has 11GB; neither will bottle¬ 
neck us in these synthetic benchmarks. 

• Storage: consumer SATA SSDs that will not be a bottleneck, 
especially since our tests should reside completely in RAM. 

• OS and software: Development was cross-platform; kernels 
were written on Ubuntu Linux with Microsoft’s open-source 
VSCode as a text editor and compiled using the CUDA Toolkit. 
During the memory optimization phase of the project, NVIDIA 
Nsight Visual Studio Edition on Windows was used for its 
“Next-Gen CUDA Debugger,” though it is noted that the Lin¬ 
ux/Mac Eclipse edition also contains an Eclipse-based pro¬ 
filer. 

• Programs were compiled as 64-bit in case we use more than 
4GB of RAM, possible with high buffer sizes and high num¬ 
bers of parallel waveguides. 

We discuss development of two algorithms: high-Q filters suit¬ 
able for use in modal processors, and a simplified form of digital 
waveguides, running independently without scattering junctions and 
only a gain multiplier in the feedback loop. These two systems were 
developed simultaneously and do not depend on each other; we begin 
with the modal filter code since it is simpler, can essentially ignore 
the GPU memory hierarchy (everything besides output data fits in 
registers), and we estimate will be bottlenecked exclusively by the 
floating-point throughput of the graphics card, which makes it the 
easier of the two to optimize. 

2. MASSIVE MODAL FILTER BANK 

As described above, a modal filter bank used for synthesis, effects or 
reverberation consists of N resonant filters. We make the assumption 
that all the filters are uniform in construction and vary in parameters; 
a GPU can of course run multiple styles of filters in parallel, either 
through conditional execution or simultaneous kernel execution. 

In practice, rapidly changing the coefficients on e.g. Direct- 
Form II filters may result in audible artifacts. In [12] Max Mathews 
and lulius Smith proposed a filter that is very-high-Q, numerically 
stable, and artifact-free, based on properties of complex multiplica¬ 
tion. 

This is suitable for modal synthesis and reverberators such as in 
[ 2 ]; the recursive update equation we need to implement is: 

y m (t) = 7 m x(t) + - 1 ) ( 1 ) 

where: 

*() is an input or excitation signal. 

LOm is mode frequency m. 

7 m is a per-mode complex input amplitude gain. 

a m is a per-mode dampening factor. 

This is straightforward to implement; the state we store for each 
mode is limited to the prior output y m {t— 1 ), the parameters a m , 7m, 
and c dm, even if only for intermediate computation. For simplicity 
we keep them all; noting that while complex values use two 32-bit 
registers each (four when using 64-bit precision), we likely have 255 
registers per thread and have room to spare. 

We benchmark three approaches: 

When letting these resonating filters run as undamped oscilla¬ 
tors, we are able to compute and reuse the complex exponential 
value, and only conditionally add the input term; with these sim¬ 
plifications we will require two floating-point multiplies per cycle. 
We create a benchmark to determine the number of such oscillators 
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we can run in parallel in realtime. We run two variations of this 
benchmark at different buffer sizes. A third benchmark simulates a 
pesrformance that modulates all the filters on every sample: we re¬ 
compute the exponential term each time it is used, and look at the 
performance impact. 

We move to benchmark those three approaches. In more detail: 

Free-Run is the optimal case where the oscillators only need to 
update based on a complex multiplication of y(t — 1) with a static 
value of the complex exponential. A buffer size of 2,000 samples 
is likely larger than we’d want for realtime performance (45ms at 
44.1kHz), but allows us to reduce kernel-switch overhead. 

Small Buffer is identical to Free-Run, but with a buffer of 256 
samples (5.8ms at 44.1kHz). 

Continuous Modulation is our third approach, simulating gain 
parameters and frequencies changing continuously, requiring recom¬ 
puting the complex exponential term with each sample update, in ad¬ 
dition to performing the 2-multiply complex update of the filter state. 
This case uses the same 256-sample buffer as Small Buffer. 

We measure the amount of time it takes to render ten seconds 
of 44.1kHz audio for N phasor filters in parallel. This means that 
benchmark runtimes over 10 seconds fall behind realtime perfor¬ 
mance, while values under 10 might be feasible. For each trial, the 
median of three runs was used; in practice we did not see large out¬ 
liers in these tests. 

Tabulated results are in Table 2; bold entries took less than ten 
seconds to compute and thus are candidates for realtime performance. 
In practice, we might want to avoid values under but close to ten sec¬ 
onds, due to system variance and unmeasured overhead of a DAW, 
OSC server, controller processing, etc. The same data is available as 
a plot in Figure 1, with a horizontal line representing realtime limits. 
In all graphs in this paper, lines between sample counts are present 
only to show trends, and we do not expect results for intermediate 
values of N to fall precisely on that line. 

Table 2: Time to run N filters for 10 seconds of Audio 


N Filters 

Free-run 

Small Buffer 

Continuous Mod. 

458752 

1.48 

2.95 

3.97 

917504 

2.63 

4.18 

5.54 

1835008 

4.85 

7.17 

8.49 

3670016 

9.23 

11.39 

13.21 


Some observations: 

As these filters are completely independent, we achieve high uti¬ 
lization on the GPU and are only blocked on availability of floating¬ 
point units. All data is stored in registers and we avoid memory 
accesses, especially global memory accesses. 

It is worth reiterating that this is benchmarking building blocks. 

We synthesize audio and copy it back to the host RAM, but addi¬ 
tional logic is needed on the CPU to modulate parameters based 
on realtime user input or performance data and most likely to post¬ 
process the output with effects. 

Using a smaller buffer incurs more cost, which can be 50% and 
even higher, percentage-wise, for low N. At very high N the effect 
is lower; we bottleneck on floating point unit availability in the large- 
buffer version, but have lower kernel launch overhead. 

As a final observation on Table 2's data, the continuously-modulated 
version does not suffer as large a performance penalty as expected 
since it looks like we had some idle 32-bit floating-point units - they 
are not occupied every cycle. It also allows us to eliminate a condi¬ 
tional check since we always run that logic. 
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Figure 1: Time to run N filters for ten seconds of samples under 
different trials. 


Moving forward, we benchmark the use of double-precision arith¬ 
metic. We made an alternate 64-bit kernel - basically swapping 
cuDoubleComplex in for the default cuComplex, which is by 
default typedefed to be single-precision. 

With a 256-sample buffer and continuously-changing parame¬ 
ters, and N=458,752 filters, it takes 19.49 seconds to render 10 sec¬ 
onds of audio. Our corresponding single-precision trial only took 
3.97 seconds, so we note a 4.9x slowdown. As noted earlier, each 
SM on our GPU may only issue four 64-bit floating point multiply- 
adds versus 128 32-bit adds. As we did not achieve 100% utilization 
of the floating point units in prior benchmarks, we don’t necessarily 
suffer a 32x (128/4) slowdown, but it is clear we are being bottle¬ 
necked by double-precision FPU availability with this configuration. 

As we might expect, scaling down to 114,688 filters lowers re¬ 
source contention enough to run within our realtime constraints (7.04 
seconds to synthesize 10 seconds of audio). If we need the extra pre¬ 
cision, that is likely still more than enough high-Q filters to enable 
some interesting instruments and effects, such as creating a virtual 
drum set with several thousand filters available to each instrument. 

3. MASSIVE WAVEGUIDE “ORCHESTRA” 

Next, we code up a kernel that performs the computations for a sim¬ 
ple 1-D Digital Waveguide. This follows the description of the struc¬ 
ture from Section 1.1; each thread owns a bidirectional delay line 
made up of continuous memory on its thread stack. This is used as a 
circular buffer, with an index value serving as a read/write head, and 
a multiplicative factor on feedback introduces dispersion loss. Note 
that for most waveguide-based physical models, additional code will 
be needed for scattering junctions, nonlinearities, etc., reducing our 
maximum throughput and complicating our kernel code, but parallel 
waveguides may be benchmarked as a starting point, to suggest an 
upper bound for performance. 

As a baseline, we start with uniform waveguides of delay-length 
M=5000 samples in total 2 * , and process audio in 2000-sample chunks 

2 lengths in this benchmark represent the total length of the delay in the 

system; if building a waveguide from a bidirectional delay line, each delay 
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with each launch of the GPU kernel. 

Because a buffer size of 2000 at 44.1kHz would be 45 milliseconds- 
longer than we'd like for interactive applications-we performed the 
same trial at 256 samples (5.8ms). 

Then, because we expect to run out of thread registers and spill 
to expensive “local” memory (as above, really in the global pool) we 
run a second variant of waveguides of length 10 samples and a buffer 
of 2000 samples-arguably too much of a simplification, but this may 
be useful to establish a loose upper bound on performance. 

In each trial, we compute the amount of time it takes to repeat¬ 
edly run the kernel on the GPU and copy some data back out to the 
host. 

The data copy is non-negligible overhead; assuming 1.8M waveg¬ 
uides and buffer of length 2000, we generate 13GB of audio data 
each kernel execution. As such, we first sum all the samples in each 
warp to reduce the overhead by a factor of 32 - still leaving us with a 
substantial amount of sound data to transfer across the PCI Express 
bus. 

As in the high-Q filter benchmarks, we build N independent 
objects in parallel and measure the time it takes to synthesize ten 
seconds of sound at 44.1kHz. 

Results are in Table 3. IV is a multiple of 32 to ensure all warps 
are occupied. Bold entries take less than ten seconds to compute and 
thus ran faster-than realtime. A plot of the data is in Figure 2. 


Table 3: Time to generate 10s of Audio, Uniform Waveguides 


NDWGs 

Baseline 

Small Buffer 

Short Waveguide 

3584 

0.249 

0.544 

0.11 

14336 

0.522 

0.811 

0.272 

57344 

1.44 

1.75 

0.95 

114688 

2.79 

3.09 

1.83 

229376 

5.5 

5.74 

3.68 

458752 

10.85 

11.07 

7.28 

917504 

21.14 

20.24 

14.59 

1835008 

42.611 

49.72 

29.152 


We note some trends: 

As expected, computing more waveguides requires more time. 
Scaling is sub-linear while growing at small N as we utilize more 
of the GPU in the parallel section of the benchmark (“for free”), but 
we still incur a cost for memory transfer of the outputs off the card, 
which itself scales linearly with N. The parallel sound synthesis 
portion of the program becomes linear with N as resources are ex¬ 
hausted; beyond this point we essentially are cycling through groups 
of warps serially. 

Decreasing the buffer size from processing 45ms to 6ms of audio 
per kernel execution did not seem to affect the feasible N as much as 
anticipated. There is a notable 2x difference at small N but for both, 
the 458,000 waveguides trial was not feasible while the 230,000 
waveguide trial used approximately 55% of the available time slice. 

A variation of the trial using a shorter waveguide showed that 
through the range of our trial values of N, scaling is partially de¬ 
pendent on memory usage. As noted above, this is an experiment 
performed to validate that, as we might expect, longer-length delay 
lines may incur more computational cost. Of course, the delay line 
lengths used in practice will be defined by our physical model and 
sampling rate. 
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Figure 2: Time to run N waveguides for ten seconds of samples under 
different trials. 


These preliminary benchmarks suggest a rough upper bound, so 
in our iterative development approach we return to coding, refin¬ 
ing our kernel and un-relaxing some assumptions. Currently, all the 
waveguides have the same delay line length; this is unrealistic for 
real-world applications, so we next move to have waveguides play 
one of 1000 different pitches, by having each parallel digital waveg¬ 
uide own a delay line of different length. We use our baseline setup, 
and allocate the same blocks of memory as before, but have waveg¬ 
uide n be of length 128 + 5 n mod 5000. This means that all waveg¬ 
uides inside a warp will have different delay line lengths, and warps 
compute different values overall. 

Results are in Table 4. 

Table 4: Time to generate 10s of Audio, “Baseline” uses same-length 
waveguides, “Differing” experiment uses heterogeneous waveg¬ 
uides. Slowdown Factor is the multiplicative performance penalty. 


NDWGs 

Baseline(s) 

Differ.Lengths(s) 

Slowdown Factor 

3584 

0.249 

2.0 

8.03x 

14336 

0.522 

5.03 

9.63x 

57344 

1.44 

19.72 

13.69x 

114688 

2.79 

39.41 

14.12x 

229376 

5.5 

78.41 

14.26x 

458752 

10.85 

156.08 

14.38x 


This is not ideal; we see a slowdown factor of 14x in our highly- 
parallel cases and went from supporting computation of 450 thou¬ 
sand simultaneous commuted waveguides to only 28 thousand. What 
changed? Two initial ideas come to mind: 

Increased branch divergence'. Some advice when writing GPU 
kernels is to avoid branch divergence where a portion of the threads 
in a warp take one path of an i f () statement but others take the 
else (). This is because GPU threads do not have independent 
branching logic: the SIMT approach means that all threads exe¬ 
cute each instruction in lockstep. In the case of an i f () statement, 
threads evaluate the conditional and vote; if they are not unanimous, 
then both branches are executed in serial and threads ignore execu- 
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tion during the branch they did not take. 

Looking at our code, the piece of our code that uses branching is 
trying to determine whether to loop in a delay line’s circular buffer: 

if (bufferlndex >= waveguideLength) { 
bufferIndex = 0; 

} 


The overhead of running both branches is minimal: we incur 
an extra instruction of setting a register to zero more often (1 cycle) 
and the else() branch is a no-op. In addition, GPUs have support for 
predicated instructions for short branches, which means this case is 
compiled to the non-branching code: 

cond = bufferlndex >= waveguideLength 
cond? bufferlndex = 0 

This may be validated by looking at generated PTX pseudoassem¬ 
bly code, or using profiling tools to annotate branch divergence for 
each line of our source code after a test run. 

The second thought of why we see slowdown when introducing 
heterogeneous delay line lengths is memory access patterns. 

Our block of memory for waveguide state was defined to be 
of size WARPS I ZE*NWAVEGU IDES by BUFFERSIZE rows. This 
means the N waveguides write to memory locations 0..N — 1 on the 
first sample, N..2N — 1 on the second sample, etc. 

Global memory access in CUDA is slow, but reads and writes 
may be coalesced; that is, if all threads in a warp are accessing 
data in the same aligned 128-byte block, only one to four line reads 
will need to occur (this is card-dependent). Newer cards have better 
caches, compiler optimizations, and runtime logic for global mem¬ 
ory placement, but this is still worth considering. 

In our case, consider we have 32 waveguides in a warp; these are 
of lengths 5000..5031. During the first “trip” through the waveg¬ 
uide’s circular buffer (first 5000 samples), memory is aligned as all 
waveguides index to the same offsets. Over the next several cycles 
through the waveguide, some waveguides will cycle earlier than oth¬ 
ers and eventually we will reach a state where we require simultane¬ 
ous memory reads to 32 different lines, so slowdown will result. 

Such memory accesses are cached, but with high numbers of 
waveguides we could easily evict old entries quickly. We open the 
CUDA Analysis tools, profile memory access, and find that this is 
indeed the case; Figure 3 shows lots of global memory accesses with 
only 2% hitting the LI cache: 



Figure 3: CUDA memory profiler results. LI cache hit rate is 2.4% 


To work past this slowdown, we propose two ideas: 


3.0.1. Synchronize on Cycle Point 

We could determine the longest waveguide in a block and ensure all 
waveguides’ circular buffers loop at the same moment. For example, 
if we have guides of length 250 and 255, the former avoids writing 
to memory until our indexing counter loops back to index 1 of the 
array. The tradeoff is that we need to run more overall cycles in order 
to completely fill the output buffer from the shorter waveguide 3 . 

In a degenerate case, what if we have waveguides of length 5000 
and 100? Only 2% of cycles are spent actually generating audio for 
the shorter waveguide with a naive approach. 

We could sort all waveguides by length globally, so that similarly- 
sized waveguides are in the same warp, to minimize the number of 
extra iterations-however this makes it much slower to later couple 
specific waveguides together, which we aim to do in a project that 
leverages this acceleration. 

We could consider a middle ground where we have multiple el¬ 
igible cycle points. Perhaps every 32 or 64 calculated samples, we 
could reset and unblock waveguides that are currently idle. This 
introduces a tradeoff between number of simultaneous memory ac¬ 
cesses and number of cycles to compute at the end. On a positive 
note, in the case of coupling a busy memory controller with the “un¬ 
derpowered” 64-bit Floating Point unit on consumer cards would 
hide some of the drawback of each. 

In the end we did not pursue this approach in depth; a second 
approach was more promising and produces more readable code: 

3.0.2. Shared memory 

Shared memory is a type of memory that belongs to a thread block 
and has much lower latency than global memory, but is only readable 
by threads within the block that owns it. This sounds great for our 
current use case and bottleneck. With our card’s hardware, we have 
96KB of shared memory available to each thread block. This means 
that we could choose, for example, two warps per block and have 
1.5KB of RAM per thread, or 384 32-bit samples per thread. 

We may wish to have longer waveguides than 384 samples, so 
we propose two workarounds: 

• Lower utilization: Simply split the 1.5KB up among fewer 
threads, and use for example 24 out of 32 threads in a warp. 
While our overall utilization will be lower, the faster memory 
might save us enough time overall to run multiple copies of 
our work serially to increase N globally. 

• Mixed-size waveguide groups: We could split the available 
shared memory such that, for example, the two shortest-length 
waveguides in a warp each donate half their buffer to the 
longest waveguide. 

While the second approach may still seem problematic from a 
memory access point of view, the rules for shared memory access 
optimization are different than those for global memory access opti¬ 
mization. Shared memory on our GPU’s architecture is grouped into 
32 banks, and as long as two threads do not access the same bank 
at the same time (a bank conflict ), we obtain full-speed access. On 
our card, access is actually done in two successive stages, each ob¬ 
taining results for a half-warp, so the “donation” approach should be 
safe from slowdown as long as we can arrange memory accesses in 
time to have no bank conflicts. With simple donation schemes this is 
straightforward. 

3 we also note writes to that output buffer, previously perfectly aligned, 
are now unaligned themselves. 
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As a middle ground, we can sacrifice some compute utilization 
for larger buffer donations. Consider having each warp provide a 
developer a predefined “care package” of: (a) One “Extra Large” 
digital waveguide that occupies 4 banks of shared memory (length 
configurable, up to 1536 samples of delay), (b) Two “Large” waveg¬ 
uides owning 2 banks each (up to 768 samples each), and (c) 26 
normal waveguides (up to 384 samples each). 

This still lets us compute 28 waveguides per warp vs. the 32 
we had before (87.5% effective utilization) but allows for lower fre¬ 
quency extension. We note that with such approaches we are becom¬ 
ing opinionated concerning the basic system building blocks; when 
we do this certain applications are enabled but we may block other 
applications. 

We adjust our kernel to use shared memory. Due to the resource 
configuration of the graphics card and the dimensions of problem, 
we would use more shared memory than is available in each SM, so 
we must move from using 64 threads (2 warps) per block down to 32 
threads (1 warp), at which point we are under the shared-memory - 
per-SM limit and our kernel can be scheduled. This serves as a re¬ 
minder that GPU hardware is not as abstracted as we may be used 
to when coding for a CPU. In this particular case though, the “nvcc” 
compiler helpfully caught this at compile-time since it was an over¬ 
sized static allocation, making for an easy fix. 

We also include a quick performance gain of pinning memory on 
the host, accomplished by simply swapping malloc with the API 
call cudaMallocHost. 

Results are in Table 5. As before, bolded entries are feasibly 
realtime. A plot of the same data is in Figure 4. 

Table 5: Time to generate 10 seconds of Audio, “Baseline" uses 
same-length waveguides, "Differing" experiments use heteroge¬ 
neous waveguides with either global or shared memory. 


NDWGs 

Baseline 

differing: global mem... 

...shared mem 

3584 

0.249 

2.0 

0.58 

14336 

0.522 

5.03 

0.96 

57344 

1.44 

19.72 

1.82 

114688 

2.79 

39.41 

3.50 

229376 

5.5 

78.41 

6.79 

458752 

10.85 

156.08 

12.59 


To summarize: using shared memory allows us to make higher- 
waveguide counts tractable again. We can still run over a hundred 
thousand independently-sized waveguides with half of our cycles to 
spare for extending the algorithm. 

At this point we have enough waveguides that we can spend 
some time thinking of creative applications for them. Those appli¬ 
cations will certainly make them computationally more expensive 
by adding coupling, nonlinearities, modifiable tap points, fractional- 
length delays, etc. 

4. AREAS FOR DEVELOPMENT 

We stop here, but note there may still be room acceleration. For ex¬ 
ample, relatively new GPUs including ours have the ability overlap 
kernel executions with host/device memory transfers. If we were to 
double-buffer on the host and device, we can work on one array while 
the other transfers, and vice versa. This would help us especially at 
small N or if we wanted to copy hundreds of thousands of individ¬ 
ual audio streams back to the host (skipping our current merge step 
where we sum them per-warp). 
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Figure 4: Time to run N heterogeneous waveguides for ten seconds 
of samples under different trials. 


We have been discussing the general-purpose case of supplying 
waveguides of preconfigured lengths. With a pre-specified configu¬ 
ration, we could write tooling to efficiently group waveguide com¬ 
putations into a warp for maximum resource utilization. 

In the modal filter bank, we note that we do not implement 
phase-correct input re-excitation which is a nice feature supported 
by these filters: we have logic to track zero-crossings but do not im¬ 
plement parameter updates from the host in a fashion that a “real” 
system would use. This is a simple and low-cost feature. Further¬ 
more, it is likely that either CPU or GPU should interpolate parame¬ 
ters, which is work that is not being accounted for. 

The high performance of these oscillators bodes well if we were 
to implement a massive collection of digital waveguide oscillators- 
another case requiring only a few variables and limited multiplies 
per cycle. It may be worth looking at algorithms that traditionally 
did well on VLSI architectures for use here, as the concept of many 
parallel independent instances of a module executing concurrently 
but varying on input data is shared between the two architectures. 

5. CONCLUSIONS 

We showed modern consumer GPUs may run high-Q phasor filters 
and ID digital waveguides without needing to leverage parallelism 
across time. In particular, we showed that it is feasible to build a bank 
of several hundred thousand ID waveguides, a hundred thousand 64- 
bit phasor filters with stable per-sample adjustments, or a few million 
phasor filters at 32-bit precision, all at 44.1kHz. 

Reflecting on the the overall optimization development and de¬ 
bugging strategy used here: care must be taken to have the right 
number of warps grouped together into grids and blocks. Memory 
accesses should go to the fastest RAM possible, and we need to pay 
attention to memory alignment. While CPU code does benefit from 
similar optimizations, GPU algorithms rapidly fall in performance 
when parameters stray from the ideal range. 

One other note around generalizing this code for end users: dur¬ 
ing development we consulted the capabilities of our particular graph¬ 
ics card several times, to see how many registers we have or to see the 
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various ways we can slice shared memory. While these parameters 
may be queried from the card at runtime and for the most part newer 
and more powerful GPUs contain a superset of old resources, this 
is not a guarantee, and for example if we tried to run our compiled 
waveguide binary on a GTX 480 from several generations back, it 
would fail to run because we request too much shared memory. 

Still, optimization of these algorithms can be seen as an interest¬ 
ing puzzle; profiling tools make it easy to see where bottlenecks live 
(if not how to work around them), and it’s fun to transpose an array 
or adjust memory layout and unlock a lOx speedup. 

From a sound designer’s point of view, being able to use so many 
of these building blocks at audio rates may allow for higher-fidelity 
physical models and modal effects, using commodity hardware that 
often sits idle while working with audio software. 
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ABSTRACT 

In this paper we present an assessment of the computational perfor¬ 
mance regarding the use of the AM/FM decomposition framework 
for the design and implementation of audio effects. The equations 
and intuitions are reviewed and audio examples are provided, along¬ 
side Csound code for real-time implementation. Two types of hard¬ 
ware and several computer music techniques were considered for 
the comparisons. We also introduce sqENVerb, a novel inexpensive 
reverb-enhancer effect. 

1. INTRODUCTION 

Following studies in areas like modulation vocoder [1] [2] [3] and 
modulation filtering [4] [5] [6], in our previous studies [7] [8] the 
non-coherent mono-component AM/FM paradigm was presented as 
a framework for the development of new audio effects. The theory 
was thoroughly revised and treated in [9], however, the computa¬ 
tional effort required to run different types of effects was not ad¬ 
dressed. 

In this paper we present an assessment of the performance con¬ 
sidering different computational systems and different audio pro¬ 
cessing techniques. Two kinds of computers were used, namely a 
RaspberryPi model 2B and a Lenovo ThinkPad x220. The former 
was chosen because it represents the category of low cost program¬ 
ming platforms, that can be used, among other applications, for audio 
processing; the later represents a more powerful and relatively pop¬ 
ular computational system. Netbooks and old laptops might loosely 
fall in a category between these two examples. Notice also that many 
programming platforms similar to the Pi actually outperform it, in 
the same way that many computers assembled for gaming purposes 
outperform the ThinkPad. So the assessment presented here repre¬ 
sents a somewhat conservative scenario; anything running satisfac¬ 
torily on the Pi and ThinkPad should also run in these more powerful 
computers. 

Beyond the CPU consumption, while our previous papers em¬ 
phasised manipulations on the instantaneous frequency component 
of the AM/FM decomposition, now we also address an effect ob¬ 
tained by manipulating the envelope of the signal. 

In Section 2 we will briefly review the AM/FM Hilbert-based 
framework and code for real-time implementation. In Section 3 a 
new reverb-like effect is introduced and evaluated with a brief objec¬ 
tive assessment based on audio descriptors. Then we proceed in Sec¬ 
tion 4 to a presentation and discussion of the required computational 


power in order to run the AM/FM framework and effects. Finally, 
we conclude and point our current and future work. Audio examples 
will be referenced in the paper with the symbol [►filename] and are 
available alongside Csound code for download 1 . 

2. THE AM/FM FRAMEWORK 

The AM/FM decomposition unravels a signal x(t) to a pair of com¬ 
ponents: an envelope a(t) and an instantaneous frequency signal 
/(f). Together these signals can modulate a sinusoid both in am¬ 
plitude and frequency in order to obtain the original signal back, so 

x(t) = a(t) cos /(r)dr^ . (1) 

We can also think of phasors and interpret the argument for the co¬ 
sine as an instantaneous phase, which is given by regular increments 
(the sum represented by the integral) depending of the instantaneous 
frequency. For instance, a regular sinusoid is the projection on the 
x-axis of a phasor in which the increments are always the same (tied 
to its frequency). 

In contrast to additive synthesis, where we think globally about 
the signal, the local aspect of the signals in the AM/FM framework 
tracks local dynamics in the envelope case, while the instantaneous 
frequency represents the frequency of a sinusoid that best fits the 
original signal at each instant. 

One of the possibilities for implementing the decomposition is 
by means of an analytic signal 

z(f) = x(t) + ix{t), (2) 

where i = y—T and x(t) is the Hilbert Transform of x(t). 

The Hilbert Transform shifts all the components in a signal by 
90° [10], so it might be implemented by using a set of all-pass fil¬ 
ters, as is done in the hilbert Csound opcode. The important 
characteristic of the analytic signal is the absence of the negative fre¬ 
quencies; its spectrum resembles the original spectrum of x(t) on 
the positive frequencies, while the negative components are void, so 

i r + o o 

z(t) = — J X (wjc^dw, (3) 

1 https://www.ime.usp.br/~ag/dl/lacl9.zip 
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where X (cu) is the Fourier Transform of x(t) [11]. In such a way we 
can interpret the analytic signal as a superposition of infinite phasors 
with different frequencies and radii, as shown in Figure 1. 



Figure 1: Analytic signal as a superposition of phasors. The origi¬ 
nal signal is the projection of the analytic signal onto the real axis. 
Source: reproduced from [9] 

In Csound the AM/FM decomposition might be implemented 
with the following code: 

opcode Udiff,a,a 
setksmps 1 
asig xin 

/* differentiation */ 

asig diff asig 

ksig = downsamp(asig) 

/* phase unwrapping */ 
if ksig >= $M_PI then 
asig -= 2*$M_PI 
elseif ksig < -$M_PI then 
asig += 2*$M_PI 
endif 

xout asig 
endop 

opcode AmFmAna,aa,a 
asig xin 

aim,are hilbert asig /* xhat and x */ 
a_am = sqrt(are A 2 + aim A 2) /* envelope */ 
aph = taninv2(aim, are) /* inst. phase */ 

/* inst. freq. */ 
a_fm = Udiff(aph)*sr/(2*$M_PI) 

xout a_am, a_fm 
endop 

Notice that the hilbert opcode is used in order to obtain the an¬ 
alytic signal, and also that the phase needs to be unwrapped. This 
opcode works in the time domain using 6 t,l -order recursive filters 
to keep signals in quadrature. Alternatively, we could also employ 
the hilbert2 opcode, which implements the same process using a 


frequency-domain approach implementing a finite impulse response 
filter (FIR) using a Fast Fourier Transform (FFT) algorithm. How¬ 
ever, for this paper we have concentrated on using the former method 
due to the fact that the FIR approach introduces a latency between in¬ 
put and output that is proportional to the analysis window, and there¬ 
fore it might not be as well suited to hard real-time applications. 
In the tests section, we will compare the costs of the time-domain 
AM/FM process against the application of FFT analysis-synthesis to 
a signal. 

In order to design AM/FM effects we proceed to manipulations 
in a(t) and/or /(/) followed by a resynthesis step considering the 
modified signals, as represented in the following code: 

opcode AmFmRes,a,aa 

a am p,a fm p xin 

xout a am p*cos(integ(a_fm_p)*2*$M_PI/sr) 
endop 

Notice that a_am_p and a_fm_p represent the potentially processed 
versions of the estimated a_am and a_fm (remember that Csound's 
audio variables names must start with “a”). 


3. SQENVERB: A NEW AM/FM EFFECT 


In our previous papers different families of manipulations were de¬ 
scribed and thoroughly explained. For instance, the octIFer [8], a 
beautiful sounding octaver-like effect might be obtained by multi¬ 
plying the instantaneous frequency signal by 0.5 [►octifer-half] or 
even by 0.25 [►octifer-quarter]. We emphasize, though, that these 
manipulations are not directly altering frequencies in the spectrum 
of the original signal, but are actually changing the increments that 
drive the phasor in the resynthesis process. 

Now we describe an effect not yet considered in our previous 
studies. The manipulation is based on extracting the square root 
of the estimated envelope signal. The analytic signal envelope lies 
within the [0,1] range, and considering this interval as our domain 
for the square root function, we can affirm that the sqrt will al¬ 
ways return values greater than the argument. Notice that 


yfx _ 1 

x y/x' 


(4) 


so the closer the argument is to 0, the greater will be the relative gain. 
As a consequence, moments of low-intensity sound will be empha¬ 
sized, leading to pronounced tails. Albeit reverberation is charac¬ 
terized by both early and late reflections [12], the reverberation is 
arguably more noticeable in the tail of the sound. In such a way the 
effect can be seen as a sort of compressor/expander [ 13] which in this 
case acts extending an already present reverberant tail in the sound. 

Differently than a regular gain operation that multiplies the whole 
signal by the same amount, the square root application results in a se¬ 
lective gain along the signal duration, directly influencing its decay 
and thus the perception of length. In Figure 2 we can actually check 
the influence of the Root Mean Square in both the original signal 
[►original] and the one with sqENVerb [►sqenverb]. The RMS is 
an audio descriptor related to the perception of level in a sound. 

As we would expect, the spectral information is not considerably 
altered by extracting the envelope’s square root. In Figure 3 we can 
check the spectral centroid for the original audio and the sqENVerb 
edition. The spectral centroid [14] is an audio descriptor related to 
the perception of brightness in a sound. Both the RMS and spectral 
centroid evaluation were realized with the Essentia [15] library. 
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Figure 2: RMS of dry (solid blue) and processed (dashed magenta) 
signals. Low intensity moments are greatly influenced by the square 
root operation. 



Figure 3: Spectral centroid of dry (solid blue) and processed (dashed 
magenta) signals. The operation on the envelope does not consider¬ 
ably influence the spectral centroid. 


4. AM/FM CPU CONSUMPTION 

The CPU consumption assessment in important both for artistic con¬ 
siderations (e.g. maximum tolerated latencies to avoid difficulties in 
musical performance) and also technical reasons (e.g. hardware siz¬ 
ing). In order to evaluate the computational effort to run the different 
effects, the software time 2 was used. It is executed from the shell 
with the command 

time csound amfmdafx.csd 

Here amfmdafx. csd refers to a Csound code with an AM/FM ef¬ 
fect implemented. The default time execution returns three mea¬ 
sures: 

• real: total duration of the process under analysis; 

• user: time taken to work directly on the process; 

• sys: time taken to work on system tasks related to the process. 

The CPU consumption is then given as uae r r e j) ays ■ 

The results 3 are shown in Table 1, which is divided in several 
parts: 


2 http://manpages.ubuntu.com/manpages/xenial/man1/ 
time.1.html 

3 In order to give more meaning to the numbers, the hardware specifica¬ 
tions are: RaspberryPi 2B / quad-core ARM Cortex-A7 @ 900 MHz 32 bits 
/ 1 GB SD-RAM @ 400 MHz / Raspbian / Csound 6.08; ThinkPad x220 / 
dual-core i5-2520M @ 2.5 GHz 64 bits / 8 GB RAM DDR3 @ 1333 MHz 
/ Debian / Csound 6.09.1. The sample rate considered was always 44100 
samples per second. 


• in the first part of the table some simple and inexpensive com¬ 
puter music tasks are evaluate just to set the scale for the com¬ 
parisons; 

• the second part shows the consumption for realising a FFT 
and an inverse FFT, considering different windows and hop 
sizes (shown as number of samples); 

• the performance for classic octaver and reverb implementa¬ 
tions are then shown; 

• then the raw AM/FM framework performance is presented 
(decomposition followed by resynthesis, with no effects im¬ 
plemented); 

• the second to last part shows the performance for some AM/FM 
effects explored in [9]; 

• the last part shows the performance considering the octIFer 
and the sqENVerb cases. 


Table 1: CPU consumption for different types of effects. *The Rasp¬ 
berryPi could not handle a 5000-sample long convolution reverb. 



CPU consumption (%) 


RaspberryPi 2B 

ThinkPad x220 

looped audio 

9.49 

4.69 

clip distortion 

10.77 

5.06 

FFT pair (1024/512) 

26.35 

8.52 

FFT pair (1024/256) 

37.38 

10.37 

FFT pair (1024/128) 

45.76 

12.31 

FFT pair (512/256) 

26.05 

8.68 

FFT pair (512/128) 

33.68 

10.47 

FFT octaver (1024/128) 

48.00 

12.41 

convolution reverb 2500 

88.98 

14.73 

convolution reverb 5000 

_* 

22.97 

simulation reverb 

26.13 

7.67 

AM/FM framework 

25.23 

7.53 

AM/FM IF filtering 

29.72 

7.73 

AM/FM IF compression 

33.21 

7.78 

AM/FM IF modulation 

29.23 

7.92 

AM/FM octIFer 

29.87 

7.81 

AM/FM sqENVerb 

28.51 

7.77 


The FFT algorithm [16] is used widely for the design of audio 
effects, therefore we adopt it here as a benchmark against which we 
can measure the computing costs of the AM/FM framework. From 
the table we can check that both the FFT and AM/FM schemes are 
computationally accessible, and also that the AM/FM framework is 
lighter than the FFT/iFFT in all its cases. Another observation is 
that, in both frameworks, the implementation of a manipulation in 
the alternative domain does not cause a large increase in the CPU 
consumption, in comparison to the case where the raw frameworks 
are applied without any actual effects. 

The octIFer effect delivers a high quality sonority [►octifer-half] 
for a cost considerably lower than the classic contender [►octaver], 
bearing good resemblance in the sonority. 

The sqENVerb effect shows a similar consumption in relation 
to the simulated reverb case [►simu-reverb], and a huge economy 
in relation to the convolution reverb. We emphasize that the 5000- 
sample impulse response convolution [►conv-reverb5000] required 
almost twice CPU as the heaviest FFT case; it was not even possible 
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to run in the RaspberryPi. so another IR with 2500 samples was con¬ 
sidered [►conv-reverb2500]. The sonority obtained in this case was 
not bad, but such a limitation might be questionable, and even with 
the short IR the Raspberry CPU was almost entirely taken. All the 
tested AM/FM examples leave considerable CPU headroom so other 
effects might be applied concurrently. 

5. CONCLUSIONS 

In this paper we presented, for the first time, a computational perfor¬ 
mance assessment of the AM/FM audio effects framework. The new 
AM/FM effect sqENVverb was also developed and compared to the 
established reverb techniques. 

All the examples we explored are based on the non-coherent 
mono-component Hilbert Transform case of AM/FM decomposition. 
Different techniques for the decomposition are available, and richer 
scenarios might also be considered, for instance a filter bank frame¬ 
work, where the dry signal is separated in bands and the subsequent 
decomposition and processing are applied individually to each band, 
increasing the computational cost. 

The AM/FM decomposition takes the signal to an alternative 
representation, where even subtle modifications in the envelope or 
instantaneous frequency signals might result in deep effects after the 
resynthesis. 

The means by which both the octIFer and the sqENVerb effects 
emulate the octaver and reverb effects might not be orthodox, but 
the sonorities obtained in both cases resemble the classic techniques, 
at a considerably lower computational cost. The octIFer sound is 
quite similar to the classic octaver, and the sqENVerb works fine as 
a reverb, albeit lacking any control besides a dry/wet mix parameter 
(which is actually extremely efficient for tuning a reverb). 

While it is true that powerful computational systems are increas¬ 
ingly available at decreasing cost, low-consumption algorithms will 
always be on demand: draining the battery of devices like tablets 
or smartphones with audio effects might not bring a good user ex¬ 
perience; contemporary small single-board computers are still very 
limited in processing power; old laptops and netbooks, nowadays 
usually discarded, can instead be harnessed as terrific multi effect 
pedals. 

Plugins for the octIFer and sqENVerb are currently being devel¬ 
oped, to be released as open-source software. 
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ABSTRACT 

This paper presents recent work conducted on the integration of 
mass-interaction physical models in the Faust programming lan¬ 
guage. After a brief introduction to mass-interaction networks, 
Faust, and previous works on this topic, we present a simple mod¬ 
eling framework, a Faust code generator and its associated library, 
allowing to implement ID mass-interaction models. In addition to 
the open-source tool itself, this research offers a perspective on for¬ 
malizing arbitrarily large networks of bidirectional feedback cou¬ 
plings and state-space models in FAUST, through routing patterns. 
We finish with a set of examples, and discuss future perspectives and 
challenges. 


1. INTRODUCTION 

For several decades, physical modeling has been used to synthesize 
audio by means of simulating the behaviour of vibrating objects. A 
panoply of methods have been proposed over the years, from lumped 
discrete models [1], to Waveguides [2], to large scale Finite Differ¬ 
ence schemes [3], that have gained in popularity with the increase 
of computing power. Creating a model of a mechanical instrumental 
system can be simpler than explicitly formulating the signal that it 
produces (as sound properties emerge from the physical conditions 
of the matter) and offers direct means for control and interaction, ei¬ 
ther by simulating musical gestures or by coupling the user and the 
virtual object, for instance using haptic technologies [4], 

Faust [5] is a functional programming language for real-time 
Digital Signal Processing (DSP) with a strong focus on the design of 
synthesizers, musical instruments, audio effects, etc. The Faust 
compiler can be used to “translate” a FAUST program to various 
non-domain-specific-languages such as C++, C, JAVA, JavaScript, 
LLVM bit code, WebAssembly, etc. Thanks to a wrapping system, 
code generated by Faust can be easily compiled into a variety of ob¬ 
jects ranging from audio plug-ins to standalone applications, smart¬ 
phone apps, web apps, etc. 1 This mechanism also makes it possible 
to add MIDI, OSC, polyphony, etc. support to any FAUST-generated 
program. 

1.1. Mass-Interaction Physical Models 

Pioneered in artistic applications by the CORDIS-ANIMA system 
[1] at ACROE, mass-interaction physical modeling allows to formu¬ 
late physical systems in the form of lumped networks, composed 
of two main components: masses, representing material points in a 
given space (ID, 2D, 3D) with a given inertial behaviour, and inter¬ 
actions, each representing a specific type of physical coupling (i.e., 

'The FAUST website contains an exhaustive list of all the FAUST targets: 

https : //faust. grame . f r. 


visco-elastic, collision, non-linear, etc.) between two mass elements. 
Mass-interaction systems are now used in a variety of contexts (mu¬ 
sical & other), partly for the fact that arbitrarily complex virtual ob¬ 
jects can be described simply as a construction of elementary physi¬ 
cal components. A basic model is shown in Figure 1. 








+- 

2 






m 


Figure 1: Topological representation of a mass-interaction model. 
Here, a fixed point (represented on the left) is connected to a trian¬ 
gle composed of masses and dampened springs. An input module 
interacts with the top mass through a non-linear pluck interaction. 


Unlike FDTD methods [3], creating physical models with this 
formalism avoids the need to explicitly define a mathematical model 
(partial difference equations systems, boundary conditions, etc.) for 
a given physical structure beforehand. Therefore, it lends itself par¬ 
ticularly well to iterative and exploratory design of "physically plau¬ 
sible" virtual objects, grounded in the laws of Newtonian physics 
but not necessarily limited to the mechanical constraints of the real 
world. 

Mass-interaction physical models can contain anything from a 
couple of physical elements to tens or hundreds of thousands of 
them. Assembling and configuring the models element by element 
can be very time consuming. To this end, user-friendly modeling 
environments have been proposed, namely GENESIS [6] (and more 
recently Synth-A-Modeler [7]) for ID audio applications. The for¬ 
mer offers high level tools for generating topological structures, and 
analyzing/tuning physical constructions through modal analysis [8], 

1.2. Current State of Physical Modeling in Faust 

Various projects have been using FAUST to implement physical mod¬ 
els of musical instruments. 

The FAUST-STK [9] is a complete re-implementation of the 
waveguide and modal models of the Synthesis ToolKit (STK) [10], 
It also contains various models from the Soundius Project. 2 


"Unfortunately, there is no documentation/publication on this project yet. 
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Julius O. Smith implemented a series of waveguide meshes that 
landed in the FAUST libraries 3 but that were never documented/pub¬ 
lished. 

More recently, the Faust physical modeling toolkit [11] was 
introduced. It is based on a library allowing for the implementation 
of bi-directional block diagrams in FAUST and containing a wide 
range of musical instrument parts that can be assembled in a mod¬ 
ular way. It also comes with mesh2faust[12], a tool to generate 
modal physical models compatible with the FAUST physical model¬ 
ing library using Finite Element Analysis (FEA). 

The work presented in this paper was partly inspired by Ed 
BerdahTs Synth-A-Modeler [7] (which itself direcly draws upon 
CORDIS-ANIMA [1] and GENESIS [6]). This environment allows 
for the implementation of hybrid models combining mass-interaction 
systems with waveguide models using a graphical user interface 
(GUI). Synth-A-Modeler is based on a series of Faust libraries and 
generates custom Faust programs corresponding to the models im¬ 
plemented in the GUI. While it successfully combines various types 
of modeling techniques at a high level and facilitates their control 
using custom haptic interfaces such as the FireFader [13]. it has, 
to our knowledge, never been used to implement large scale mass- 
interaction models. 

Our proposed approach does not aim to supplant BerdahTs; 
rather, from a similar starting point it questions how the Faust lan¬ 
guage’s versatility can be used to formalize arbitrarily large mass- 
interaction models - and more generally speaking complex feedback 
networks - in a direct, concise and clear manner. 


Equation (2) can be normalized to unity, and rearranged in order 
to express the mass' position update scheme (discrete-time positions 
and forces are noted X and F ) : 

X (n+1) = 2X (n) - A (n _ 1} + ^ (3) 

With M, the discrete time inertial parameter defined as : 


Hence, the basic discrete-time mass module produces new po¬ 
sition data based on its current position, previous position, the 
"discrete-time" mass parameter M, and the sum of forces applied 
to the mass from the previous interaction computation step. 

The initial position A'( 0 ), delayed initial position A(_n (which 
infers initial velocity) and initial force F( 0 ) must be supplied at the 
start of the computation. 

2.1.2. Discrete-time implementation of a dampened spring 

The elastic force applied by a linear spring with a stiffness k and a 
resting length of lo = 0 connecting a mass m2 at the position X 2 to 
a mass ml at the position xi is given by Hookes law: 

/si ->2 == -k.(x 2 - xi) (5) 

The exact equivalent of this equation in discrete time is : 


2. MASS INTERACTION PARADIGM IN FAUST 


-Fsl->2(n) — ~K.(X r 2(n) ~ -^l(n)) (6) 


Before getting into implementation specifities, this section presents 
the basics of mass-interaction networks, in the case of ID systems, 
in which all masses vibrate along a single z axis. These models are 
sometimes referred to as "zero-D", since they are purely topological 
and contain no direct geometrical information. First, discrete-time 
mass and interaction physical algorithms are presented and assem¬ 
bled into an explicit computational scheme. 

Then, relying on a matrix-based representation of the topological 
network, we present a generic Faust architecture that implements 
this computational scheme. 

2.1. Discrete-Time Physical Algorithms 

Below, we present finite difference implementations of two of the 
most basic elements in a mass-interaction network: punctual masses 
and springs. 


Where the discrete-time stiffness parameter K = k. The fric¬ 
tion force applied by a linear damper with a damping parameter 2 
connecting the same two masses is : 


/dl->2 


d(x 2 — X1) 

dt 


(7) 


Using the Backward Euler difference scheme, the frictional force 
can be formulated in discrete-time as : 


/dl->2(t) 


(x 2 {t)-x 1 (t)) - (x 2 (t-AT) 
AT 


xi(t-Ar)) 

( 8 ) 


Which after normalization becomes : 

Fdl-t2(n) = —Z.{{X2(n)—X2(n- 1)) — (-^l(n) — -^l(n_l))) (9) 


2.1.1. Discrete-time implementation of a punctual mass 

The motion equation for a continuous time mass is given by New¬ 
ton's second law: 

r d x /1 , 

/ = m .a = m— ( 1 ) 

Where / is the force applied to the mass, m is its inertia a its 
acceleration, and x its position. Applying the second-order central 
difference scheme, with the sampling interval noted AT, a discrete 
equation of the mass can be formulated as follows: 

f(t) = m a; ( t + Ar ) ~ 2x (t) + x(t-AT) (2) 


3 https://github.com/grame-cncm/faustlibraries 


With the discrete time inertial parameter Z defined as : 


Z = 


z 

AT 


( 10 ) 


The global equation of the force applied by the dampened spring 
is composed of F a and Fd : 


F(n) = - K.(X 2 (n)-X !(„)) 

— Z.((X 2 ( n ) — X 2 ( n -1)) — (X 1 ( n )—X 1 ( n _ 1 ))) 

It is applied symmetrically to each mass (Newton’s third law): 


F 2 — »l(n) F(n) 

F]_—}2(n) = +-F(n) 


( 12 ) 
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2.1.3. Discrete mass - dampened spring - fixed point oscillator 

A linear harmonic oscillator is obtained by combining equations (3) 
and (11), in the case where A'i is a fixed point set to Xu n ) = 0, 
n G Z. This results in : 

- ( 2 - t )- y « + + F -w <13) 

Since the basic oscillator is a very common element in modeling, 
the integrated form given in (13) can be implemented in the form a 
specific mass-type module (although it is identical to assembling a 
mass, dampened spring and a fixed point). 

2.1.4. Generalization 

Any element in a mass-interaction model follows the basic template 
of the elements described above. More complex interactions stem 
from conditional statements (e.g. springs only active during inter¬ 
penetration of two material points, as in visco-elastic collisions) or 
dynamic stiffness or damping parameters that depend on the posi¬ 
tion and/or velocity of the connected material points (e.g. through 
non-linear lookup tables, such as in plucking or bowing interactions 
[14]). 

It is important to note that the M and Z parameters are depen¬ 
dent on the sampling interval. Hence, the oscillatory behaviour of 
physical models will be dependent on the sampling rate of the simu¬ 
lation. 



* 


Figure 2: Computation cycles of the model presented in Figure 1. 
At each time step, the mass-type algorithms are first computed using 
the forces calculated in the previous step, then the interaction-type 
algorithms are computed using the new positions. 


2 . 2 . Computation Scheme 

Computing a mass-interaction model consists in calculating the 
mass-type and interaction-type algorithms in a closed loop. The ex¬ 
plicit time step increment is carried by the masses, as shown in the 
discrete-time equation (3). The interactions in themselves are delay¬ 
less operations, but can be computed since their output is fed back 
into the masses for the next calculation step (cf. Figure 2). In other 
words, calculating a step of real-time audio requires to run all the 
masses’ algorithms once, then all the interactions’ algorithms. 

2.3. Representing the Topological Network 

The topological connections of a mass-interaction model can be for¬ 
malized as a routing matrix of dimensions J x 2 K, where J is the 
number of material elements (or M points) in the network, and K is 
the number of interactions (each interaction module has two connec¬ 
tions -or L points in the usual terminologyfl]) : 

*0J1 *0J2 ■ ■ ■ ikj 1 

mi / 1 or 0 1 or 0 . 

m2 1 or 0 . 

mj \ 1 or 0 . 1 or 0 / 


ik_n 

1 or 0\ 

(14) 


As an example, (15) presents the routing matrix for the topolog¬ 
ical structure shown in Figure 1. The material elements (fixed point, 
three masses and a position input module) are represented vertically 
and the L points of the four springs and the non-linear interaction are 
represented horizontally. 

The closed-loop physical calculation scheme performed by 
FAUST is shown in Fig. 3. On the left, a LinkToMass connec¬ 
tion function routes the force feedback signals produced by the in¬ 
teractions based on the routing matrix (thus calculating the sum of 
forces for each mass). The new positions of the material elements 
modules are then calculated. These positions are then fed into a 
MassToLink connection function, that routes the signals to all of 
the concerned interactions. Finally, the pairs of force signals pro¬ 
duced by the interactions are fed back for the next calculation step. 

Position and force inputs are directly incorporated into the 
LinkToMass function, so that they are applied to the correct in¬ 
put module. Similarly, modules whose positions are observed as 
audio outputs are simply added as extra signals at the end of the 
MassToLink function. 


2.4. Faust Implementation of Mass and Interaction Elements 


Each column in the matrix must have a single connection set to 
1 and all others to 0, as an L point only connects to a single M point 
(partially connected interactions are not allowed). On the other hand, 
a material point could be connected to any number of interactions in 
a given model (many connections set to 1 for a single line). 


The mi . lib library contains the Faust implementation of most 
elementary mass-type elements (i.e., masses, fixed points, oscilla¬ 
tors, etc.) and link-type elements (i.e., springs, collisions, non-linear 
plucking / bowing, etc.). Since the implementations are similar, we 
will explicit only the two simplest and most common elements be¬ 
low: the mass and the spring. 
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Figure 3: Faust -generated diagram corresponding to the model presented in Fig. 1. 


2.4.1. Mass 

The discrete-time algorithm of a basic mass module described in 
(3) can be easily expressed with letrec environment expression 
in Faust: 

mass(m,xO,xl) = equation 
with { 

A = 2; 

B = -1; 

C = 1/in¬ 
equation = x 
letrec{ 

'x = A*(x : initState(xO) ) + 

B* (x' : iniState((xO,xl))) + 

* (C) ; 

} ; 

} ; 

Listing 1: The discrete-time mass algorithm in FAUST. 

The module takes an input signal (the sum of all forces fed back 
through the interaction feedback loop and routing function) and pro¬ 
duces a position output. The initial position and delayed position of 
the module are dealt with using the initState function, which 
initializes the first step with the correct values. 

2.4.2. Dampened spring 

Similarly, a visco-elastic spring expressed in Faust is shown in List¬ 
ing 2. Interaction modules such as the spring take two input signals 
(the positions of the masses connected together by the link) and pro¬ 
duce two identical and opposite force signals. 

Attention must be paid to the correct initialization of velocity 
based interactions, especially when the initial position or speed of 
the masses is non-zero. To this end, the delayed initial positions 


of the two connected mass elements are supplied as arguments to the 
interaction function, which initializes them with the initState ( ) 
function. 

spring (k, z, xlrO,x2r0,xl,x2) = 
k* (xl-x2) + 

z* ( 

(xl - (xl' : initState (xlrO)) ) - 
(x2 - (x2' : initState(x2r0))) 

) 

<: *(-l),_; 

Listing 2: The discrete-time dampened spring algorithm in Faust. 


3. CREATING MODELS WITH MIMS 

Mass Interaction Model Scripter 4 is a simple graphical or 
command-line tool written in Python to generate structured Faust 
code from a textual description of a physical model. 

Models are described in a format similar to the PNSL language 
[15]: each physical element has a specific label, specific physical 
parameters and/or initial conditions, etc. Parameters can be added 
to this description and shared by any number of physical modules, 
allowing global variation of the physical attributes (i.e., stiffness, 
damping, mass, etc.) of a subset of modules in real-time. 

MIMS’ physics2faust tool compiles the model by : 

• parsing all of the physical modules and noting any specific 
elements (i.e., position or force inputs, audio outputs, etc.) 

• creating the routing matrix and translating it into the two dual 
FAUST routing functions. 

4 https://github.com/mi-creative/MIMS 
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• ordering the resulting data into the output .dsp file. "Place¬ 
holder" functions are created for position / force inputs, al¬ 
lowing the user to describe his input functions directly in the 
Faust code. 

# Define global parameter attributes 
@m_K param 0.1 

@m_Z param 0.001 

@nlK param 0.05 
OnlScale param 0.01 

# Create material points 
@m_s0 ground 0. 

@m_m0 mass 1. 0. 0. 

@m_ml mass 1. 0. 0. 

@m_m2 mass 1. 0. 0. 

# Create and connect interaction modules 
@m_r0 spring @m_s0 @m_m0 0.05 0.01 
@m_rl spring @m_m0 @m_ml m_K m_Z 

@m_r2 spring @m_ml @m_m2 m_K m_Z 
@m_r2 spring @m_m2 @m_m0 m_K m_Z 

# Inputs and outputs 
@inl poslnput 0. 

@outl posOutput @m_m2 

# Add plucking interaction 

@pick nlPluck @inl @m_ml nlK nlScale 
Listing 3: MIMS description for the model presented in Fig. 1. 

The graphical UI version of MIMS also provides basic tools 
for generating certain categories of physical structures (i.e., strings, 
membranes, etc.) and performing modal analysis of linear structures. 



Figure 4: The MIMS model editor prototype. 


The Faust code generated from the model in Code Listing 3 is 
presented in Code Listing 4. The only hand-written element is the 
inPos function, that adds a graphical slider to control the position 
of the input mass. The control-rate output of the slider is smoothed 
to avoid artifacts. 


import("stdfaust.lib"); 
import("mi.lib"); 

inPos = hslider("pos",1,-1,1,0.0001) : si. 

smoo; 

OutGain = 10.; 

m_K = 0.1; 
m_Z = 0.001; 
nlK = 0.05; 
nlScale = 0.01; 

model = ( 

RoutingLinkToMass: 
ground(0.), 
mass (1. ,0 . , 0 . ), 
mass (1.,0 . , 0 . ), 
mass (1.,0 . , 0 . ), 
poslnput (0.) : 

RoutingMassToLink : 


spring(0.05,0.01 

, 0. 

, 0.), 

spring(m_K,m_Z, 

0., 

0.) , 

spring(m_K,m_Z, 

o.. 

0.) , 

spring(m_K,m_Z, 

o.. 

0.) , 

nlPluck(nlK,nlScale) 

r 

:i, 1,_) 

10, _) : par (i, 

10, ! 

) , par 


with { 

RoutingLinkToMass(10_f1,10_f2,ll_f1, ll_f2 , 
12_f1,12_f2,13_f1,13_f2,14_f1,14_f2,ini) 

= 10_fl, 10_f2+ll_fl+13_f2, ll_f2+12_f1+ 
14_f2, 12_f2+13_f1, 14_f1, ini; 
RoutingMassToLink(mO,ml,m2,m3,m4) = mO, ml, 

ml, m2, m2, m3, m3, ml, m4, m2, m3; 

} ; 

process = inPos : model: *(OutGain); 

Listing 4: MIMS description for the model presented in Figure 1. 


4. EXAMPLES AND EVALUATION 

The basic mijaust package contains several examples of virtual in¬ 
struments and use-cases of mass-interaction physics in FAUST. All 
of these examples can be compiled and executed directly as web ap¬ 
plications via the FAUST online editor, 3 with generic user interfaces. 
They can also be found as pre-compiled web-apps on the mijaust 
project web-page. 5 6 

• IPlayTheTriangle: the demonstration model discussed 
previously in this paper (Figure 1). 

• PolyTriangle: the same model (with a direct force im¬ 
pulse applied instead of a pluck system), using Faust's abil¬ 
ity to automatically handle polyphonic voice allocation for 
MIDI instruments. 

• PluckedHarmonics: a 150-mass string terminated by two 
fixed points. The first position input allows plucking the 

5 https://faust.grame.fr/tools/editor/ 

6 https://faust.grame.fr/community/ 
made-with-faust/mi-faust 
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string, and three others are used to press down lightly on the 
string at specific areas in order to bring out natural harmonics. 

• BowedString: a bowed string, using the nlBow interaction. 
The user can control bow pressure and velocity, as well as the 
stiffness of the string. 

• LargeTriangleMesh: a big triangular mesh, fixed at one 
summit, excited by a plucking system and damped by user 
input. 

• Resonator: the audio input is fed into one end of a resonat¬ 
ing physical model. The user can alter the properties of the 
resonator. 

• PhysicalLFO: Using a physical model with slow dynamics 
as a control variable for another synthesis process. Here, the 
wave propagation observed along a very loose string is used 
to modulate the amplitude of a white noise source, generating 
AM modulation going from complex patterns at the onset to 
quasi-sinusoidal modulation as the higher modes of the string 
decay. 

In addition to these examples, two large structures (a 20 by 
30 mass mesh: 20x30mesh and a 1000 mass string: 1000 
massString) were created for model complexity tests. The bench 
test results in Table 1 show the compile time and CPU load for var¬ 
ious models. Large routing functions result in slower compilation, 
and maximum complexity is reached for approx. 1800 physical el¬ 
ements. Overall, fairly complex models run well, with a reasonable 
CPU load. 


5. FUTURE WORKS 

5.1. Faust 

Faust proves to be well adapted to implement mass-interaction 
physical models. The combination of connection matrices and of 
the use of the letrec environment expression allowed us to seam¬ 
lessly implement the various elements of mi . lib. However, this 
raised some issues that will need to be solved in the future. They are 
presented below. 

5.1.1. Specifying Initial States in letrec 

The letrec environment expression doesn’t allow us to specify an 
initial state (i.e.. the value of y(n — 1), y(n — 2), etc. at n = 0). We 
got around this problem by implementing the initState function 
which requires some unneeded computation. Hence, letrec could 
be modified to allow this type of expression to be written (rewriting 
Code Listing 1): 

equation = x 
letrec{ 

x' = xO; 
x' ' = xl; 

'x = A*x +B*x' + * (C); 

} ; 

We believe that this would significantly reduce computation for 
large scale models. 


5. 1.2. Optimizing Routing Matrices 

The current “bare bone” implementation of connection matrices 
(e.g., RoutingLinkToMass in Code Listing 3) is hard to solve 
by the FAUST compiler, preventing large models to be generated (see 
§4). This could be solved by turning this operation into a primitive 
of the language. Compilation time would be significantly reduced 
since pattern matching [5] wouldn’t be involved to solve this type of 
expression. 

6. CONCLUSIONS 

In this paper, we have presented early results of formal integration 
of ID mass-interaction physical modeling into the Faust environ¬ 
ment. resulting in a new library. The MIMS and physics2faust tools 
allow to automatically generate FAUST dsp code for complex topo¬ 
logical models, by expliciting the routing scheme for the model’s 
position and force signals. Several basic models have been imple¬ 
mented and benchmarked, showing promising results. Furthermore, 
Faust’s capabilities offer an efficient solution for playing several 
dynamically allocated and parameter-mapped instances of a physical 
model across large ranges. More generally, this work extends beyond 
mass-interaction modeling and explores the possibilities for describ¬ 
ing complex feedback networks and state space-models in Faust. 
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Model Name 

N. Masses 

N. Springs 

Faust Comp. Dur. 

CPU Load 

lOOOmassString 

1000 

1002 

- 

- 

20x30mesh 

598 

1151 

20.576s 

45% 

BowedString 

150 

152 

1.962s 

14% 

IPlayTheTriangle 

3 

5 

0.029s 

1% 

LargeTriangleMesh 

324 

901 

12.083s 

48% 

PhysicalLFO 

10 

12 

0.032s 

1% 

PluckedHarmonics 

150 

152 

2.192s 

14% 

PolyTriangle 

3 

5 

0.027s 

1% 

Resonator 

30 

32 

0.056s 

4% 


Table 1: Number of masses and springs, compilation duration, and CPU load of the examples. Measurements were made on a Lenovo ThinkPad 
XI Carbon with the following configuration: Linux Manjaro, Intel i7-7500U 4 cores at2.7GHz, 16GiB of RAM, sampling rate of48KHz, buffer 
size of256 samples. Programs were compiled as ALSA applications with a GTK interface using faust2alsa. 
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ABSTRACT 

In recent years, the number of studies investigating possible non- 
invasive health screening techniques for infants have increased ex¬ 
ponentially. Amongst those, one of the most prominent is health 
screening based on the acoustic investigation of infant cry. Clini¬ 
cians involved in the field moved from visual inspection of the audi¬ 
ble spectrum to automatized analysis of cry samples using computer 
software. A software that has been more widely adopted in recent 
years is Praat, a free software designed for speech analysis. Unfor¬ 
tunately, the software’s default settings are not suitable for investi¬ 
gation of cry samples, yet rarely used settings are reported in final 
manuscripts. In this article, we tested 4 different computer gener¬ 
ated signals, with frequency features comparable to cry frequencies, 
and 3 real cry samples using both Praat’s standards and tuned set¬ 
tings. Our results highlight the importance of properly tuning soft¬ 
ware's parameters when expanding their field of usage, and provide 
a starting point for the development of optimal Praat algorithm’s pa¬ 
rameters selection for cry analysis. 

1. INTRODUCTION 

Screening of infants' health statuses can lead to early recognition 
of developmental pathologies, this allows clinicians to define an in¬ 
tervention program, which can lead to enhanced outcomes when 
adopted in earlier stages of life. Among infants’ health screening 
methods, non-invasive techniques received the highest level of atten¬ 
tion within the community of pediatricians and researchers. Starting 
from the second half of the Twentieth Century, researchers inves¬ 
tigated several possible ways to identify different pathologies and 
developmental issues through non-invasive methods. 

For example, pulse oximetry, a non invasive technique that mea¬ 
sures the amount of oxygenated and deoxygenated hemoglobin in 
blood by mean of infrared light, has been widely tested for early 
screening of congenital heart defects in asymptomatic newborn ba¬ 
bies [1, 2, 3, 4, 5]. Recently, in a review by Thangaratinam et al. [5], 
authors compared the overall sensitivity of this method and false¬ 
positive ratio against other screening techniques, including prena¬ 
tal ultrasounds and routine physical exams [6, 7], [what are the 
results?] One of the techniques in which researchers’ interest in- 
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creased exponentially during the last sixty years is the empirical 
analysis of infant cry [8, 9]. Acoustical properties of infant cry 
have been associated with different developmental pathologies, in¬ 
cluding Autism Spectrum Disorders (ASD), Sudden Infant Death 
Syndrome (SIDS), hearing impairments and unilateral cleft lip and 
palate (UCLP) [10, 11, 12], 

1.1. Properties of Infant cry 

Cry sound utterances are produced by the larynx during the expira¬ 
tory phase of respiration. Pressure differences of air streams flow¬ 
ing through the larynx cause vocal folds to open and close rapidly, 
from about 250 to about 550 times per second in healthy infants 
[13, 14, 15, 16, 17, 18, 19]. This ratio of vibration is defined as 
fundamental frequency (Fo) [20, 21], Position of the vocal folds is 
modulated by central nervous system (CNS), and therefore activity 
of the vocal folds can be used to estimate an infant's developmen¬ 
tal status. Moreover, the lower vocal tract produces different sound 
characteristics, including the loudness of the expiratory phase. 

The upper vocal folds concur instead in the production of higher fre¬ 
quencies. resonants of the fundamental frequency [22, 23]. During 
the first two years of life, an infant’s body evolves. The vocal tract 
shapes during this period, and therefore acoustical properties of cry 
vocalizations changes accordingly. [24, 25, 26]. 

Research studies conducted on infants suffering from pathological 
conditions highlighted a positive shift in the spectrum of cry fre¬ 
quency properties, as compared to those of healthy infants. For ex¬ 
ample, investigation of infants at high risk of developing ASD disor¬ 
ders showed that the fundamental frequency of their cry vocalization 
can be higher than 700 Hz [27], Analogous, Fo collected from vocal¬ 
ization of infants suffering from colic were significantly higher than 
those collected from healthy infants [28], 


1.2. Cry analysis 

In a typical cry experiment, audio recordings are collected by in¬ 
ducing infants to cry using a specific paradigm, or trigger (e.g. pain 
caused heel prick test [29]). Collected samples are then preprocessed 
to increase the signal to noise ratio. 
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During the 1960s, when systematic analysis of infant cry began, re¬ 
searchers relied on visual inspection of spectrograms [30, 31]. With 
the advent of more powerful computing devices, techniques and al¬ 
gorithms employed in cry analysis became more sophisticated, pro¬ 
ducing more accurate and useful results. 

Because of the similarities between infant vocalization and adult 
voice, cry researchers adopted software designed for speech analysis. 
One of the software most widely used within the field is Praat, a free 
software developed by Paul Boersma and David Weenink, specifi¬ 
cally designed for acoustic analysis of adult voice [32], In the last 18 
years, Praat has been used in 41.3% (N=36) of the articles published 
within the field during this period (N=87), detailed information about 
the software in use is provided[8, 9]. Despite being a robust tool for 
speech analysis, Praat’s default parameters are not suitable for ac¬ 
curate analysis of cry samples. In this article we discuss the role 
of Praat in cry analysis, highlighting the reasons for which standard 
settings are not suitable and provide suggestions on how to apply it 
successfully on cry samples. 

2. PRAAT 

Praat features a graphical user interface that fits the needs of differ¬ 
ent researchers, from phoneticians to musicians and biologists in¬ 
volved in the acoustic analysis of animal vocalizations. Written in 
C and C++, Praat provides tools for analysis of signals’ pitch (Fo) 
and formants in audio signals. Not only that, Praat comes with a pic¬ 
ture tool which produces high-quality graphics ready to be used in 
manuscripts and dissertations. 

The software uses a general purpose scripting language that can be 
used to automatize the analysis of multiple files, allowing for fast 
processing of large amount of auditory samples [32], 

Praat implements an auto-correlation algorithm for pitch analysis. 
According to Boersma, the applied algorithms is not only more ac¬ 
curate than other frequency-based pitch detection procedures, but is 
also less dependent on the length of selected window and more re¬ 
sistant to rapid shifts and external noise [33]. 

2.1. Praat settings 

In this work, settings have been verified on Praat version 6.0.43 (8 
September 2018), running on a Linux machine (Linux Mint 19 Tara 
x86_64, Kernel: 4.15.0-42-generic). 

2.1.1. Pitch 

Default pitch settings point the algorithm to search for FO in the fre¬ 
quency range that goes front 75Hz to 500Hz. As introduced above, 
healthy infant cry’s fundamental frequency usually lays between 250 
and 550 Hz, with the latter higher in sick infants. 

With those settings, there are at least two possible situations in which 
Praat cannot identify the real fundamental frequency value: 

• Fo is above the upper cutoff: In this situation, Praat will iden¬ 
tify a wrong value (lower) for the fundamental, or provide no 
pitch information within a window. 

• Fo lays between the cutoff values but a strong noise with a 
frequency between 75 and 250Hz is present. In the situation 
where a strong periodical noise is recorded within the signal, 
such as the presence of a split-system air-conditioner within 
the recording environment [34], it is possible that the software 


identifies this lower frequency as the real fundamental, espe¬ 
cially when this noise is about half of the real fundamental 
frequency. 

2.1.2. Formants 

Standard formant settings are used to obtain up to 5 formants with a 
frequency lower than 5500Hz. The GUI returns n — 1 formants’ fre¬ 
quency values, where n is the number of formants indicated within 
the settings. 

3. ANALYSIS OF COMPUTER GENERATED SIGNALS 

To better illustrate pitch and formant extraction errors, we tested 
Praat with standard and cry suitable settings on a set of computer 
generated signals with a specific Fo, to which white noise was added. 
Formants (N=5), with a frequency of about Fo * (n+1) and decreas¬ 
ing amplitudes [35, p. 306] have been added to the generated signals. 
For half of the files, noise at a specific frequency band, close to Fo/2. 
was added. Four different signals of 5s length have been generated. 
Audio files and the source code written in Python which were used 
to generate those signals are available online 1 . Used frequency val¬ 
ues for Fo. formants and, where added, Fo/2, are reported in Table 
1 (Real). To verify the validity of generated files, a visual inspec¬ 
tion of the spectrum was conducted. Frequency peaks are shown in 
parenthesis in Table 1. 

Using Praat we extracted value of Pitch and Formants at t = 2.5s, 
using both Praat’s standard ( Praat S.) and cry-optimized ( Praat O.) 
settings: 

• Pitch: 

- Pitch range (Hz) = 250.0 - 800.0 Hz 

• Formants: 

- Maximum formants (Hz) = 4500.0 Hz 

Fundamental frequencies and formant have also been verified by vi¬ 
sual inspection of the spectrogram using Audacity version 2.2.1 and 
the following settings: 

• Algorithm: Spectrum 

• Window size: 1024 

• Function: Hanning window 

• Axis: Logarithmic frequency 

Pitch and formants frequency obtained using the two set of settings, 
and their Mean Absolute Percentage Error ( MAPE ) are reported in 
Table 1. 

4. DISCUSSION 

As described above and demonstrated by analysis on simple com¬ 
puter generated samples, Praat's default settings are not suitable for 
the analysis of infant cry. In example A.wav , Fo is located between 
the pitch cutoff values and no periodic noise was added. We can ob¬ 
serve, that parameter optimization led to a general improvement of 
formant estimation, with the MAPE drastically reduced. 

Similarly, in example B.wav, where Fo was still between pitch cut¬ 
off values and periodic noise was above the lower cut-off with stan¬ 
dard settings but not parameters optimized, the latter configuration 

1 https://github.com/ABPLab/Praat-LAC2019 


84 



Proceedings of the 17 ,h Linux Audio Conference (LAC-19), CCRMA, Stanford University, USA, March 23-26, 2019 


granted a better recognition of the fundamental as well as of the for¬ 
mant. 

In example C.vvai', Fo was higher than the upper cut-off for the pitch 
of Praat’s standard settings. Here, pitch recognition identified the 
wrong peak as the signal pitch. This situation did not occur when pa¬ 
rameters were optimized and the higher cut-off was increased. This 
is especially important when working with pathological infants or 
where the risk of developmental pathology is high, and therefore 
acoustic properties of cry are expected to differ from those of healthy 
infants. 

Finally, as shown with file D.wav, when the presence of periodic 
noise was at about half of the fundamental frequency (with a high 
fundamental frequency), it led the software to a recognition error 
even with optimized parameters. This did not happen when the spec¬ 
trum was visually inspected, since it was clear that the amplitude of 
Fo/2 was lower than the amplitude of the peak of Fo, as visible in 
Figure 1 

Parameter tuning sharpens extracted features, but because of the prop¬ 
erties of cry, researchers still have to pay special attention to obtained 
values, as well as to the quality of collected data. 

Generally, we can expect Praat with standard settings to perform 
poorly when employed in infant cry studies, because of the com¬ 
plexity of the signal itself and of the presence of external noise. In 
the next section, we will shows the performances of Praat on real cry 
sample, using both the standard and optimized settings. 


researchers used Praat, details about the used settings were provided 
[8, 9], 


Spectrum of D.wav 



Frequency (Hz) 


Figure 1: Spectrum of D.wav, extracted using Audacity. Pitch (Fo), 
formant (Fj,F 2 ,Fj,F 4 ) and periodic noise (Fo/2) have been labelled 
accordingly. 


7. CONCLUSIONS 


5. ANALYSIS OF REAL CRY SIGNALS 

In order to provide a demonstration of Praat's performance on real 
cry samples, we analyzed infant utterances from a public dataset 
[36]. More specifically, we assessed the first three utterances from 
the file "BabyCrying2.wav", therefore named here as "Utterancel", 
"Utterance2" and "Utterance3". 

Fo and formants have been first obtained by visual inspection with 
Audacity, using the same configurations used to obtain the spectrum 
of computer generated signals. Because of the properties of cry, re¬ 
ported value are the mean values of a whole utterance. Frequencies’ 
peaks are reported in Table 2. Then, each utterance have been an¬ 
alyzed in Praat, using both the default settings ( Praat S.) and our 
suggested settings ( Praat O.). For each pair of file and settings, we 
estimated the Mean Absolute Percentage Error using as actual value 
the peak obtained manually in Audacity by visual inspection of the 
spectrum. Pitch and formants frequency values and MEAP per file 
and settings are reported in Table 2. 

6. DISCUSSION 

As shows in Table 2, the difference in the estimated MEAP of investi¬ 
gates samples follows what have been shown for computer generated 
signals in Table 1. Similarly to the previous examples, the higher the 
formant number, the higher the difference between the peak detected 
in Praat or by visual inspection. 

With an average reduction in the estimated MEAP of 18.4%, a fast 
optimization of pitch and formant detection parameters demonstrated 
to be helpful in increasing the accuracy of estimated features. As 
demonstrated by our examples, differences in the used settings can 
result in a large variance in estimated frequency values. Because of 
that, we expect researchers involved in cry studies to tune the soft¬ 
ware properly and to report used settings in final manuscripts. Unfor¬ 
tunately, this is not the case: only in 12 out of 36 studies in which the 


In this work, we demonstrated the different level of performance 
that Praat, an open source software designed for speech analysis, 
can achieve when used with infant cry samples when the parame¬ 
ters are or aren't tuned. In the first part of this work, we generated 
different acoustic signals with features similar to those of real cry 
samples. Generated files have been analyzed first by visual inspec¬ 
tion, then using Praat standard settings and finally by fine tuning the 
algorithms’ parameters. The performance of the sofware has been 
evaluated using the Mean Absolute Percentage Error (MAPE). In 
the second part of this work, we applied the same procedure to a 
set of real cry utterances. Our results show that Praat standard set¬ 
tings are not suitable for the analysis of cry signal, and therefore the 
software should not be employed in cry studies without tuning. Re¬ 
searchers have to carefully examine collected data, to ensure that no 
external sources of periodic noises are recorded within the signals. 
Furthermore, because of the high inter-individual variability of cry 
properties, it may be advisable to tune pitch and formant extraction 
settings according to the investigated participants and their health 
statuses. We advise researchers of the field to test Praat’s parameters 
with more complex and extreme cry sounds so as to identify the ex¬ 
tent to which the software can be correctly integrated in cry studies. 
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(Praat O.) settings were used and values where computed at t=2.5s. In parenthesis are values obtained by visual inspection of the spectrum 
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been estimated using as Actual value the peak highlighted by Audacity trough visual inspection. 
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ABSTRACT 

An acoustic interface (also: hybrid controller) is presented. By tap¬ 
ping, scratching, rubbing, bowing, etc. on the surface, excitation 
signals for digital resonators (waveguides, lumped models, modal 
synthesis and sample convolution) are created in synchronicity with 
augmenting control signals. It is described how a direct acoustic 
excitation delivers an intimate and intuitive interaction. Questions 
are raised about which protocols to use for isochronous audio and 
control transmission as well as file formats. Standardization of such 
protocols is desirable for future hybrid instruments with analog in¬ 
terfaces. A first step towards standardization is made with the publi¬ 
cation of our implementation. 

1. INTRODUCTION 

Recent developments in the musical instrument controller market 
follow the demand for more expressive and continuous control. At 
the same time more computing power allows for expensive synthesis 
methods so that more parameters can be made use of as a continuous 
stream of control data in several degrees of freedom. 

1.1. Keys or silicone? 

A keyboard of the MIDI standard is generally sufficient to gener¬ 
ate the parameters for a simple electronic representation of a piano. 
Mod-wheel and pitch-bend only extended this affordance mildly. For 
instruments with a continuous articulation like wind and string in¬ 
struments the single parameter velocity is inadequate. When Yamaha 
came out with the CS-80 in 1977 it pioneered after-touch on ev¬ 
ery key and laid the foundations for a class of ‘extended keyboards’ 
such as the Haken Continuum [1], McPherson’s TouchKeys [2] and 
the Seaboard [3] by Roli. All these instruments make multiple pa¬ 
rameters per key available continuously. A standardization effort 
of these parameter streams lead to the MIDI Polyphonic Expression 
(MPE) specification. lones’ Soundplane [4] and Linn’s Linnstrument 
likewise belong to this group of instruments but do away with the 
traditional (and some may say reactionary) piano key layout. 

1.2. Exciting audio 

A full audio signal is offering even more expression compared to just 
control-rate parameters. Therefore, contact microphones (piezoelec¬ 
tric sensors) have become a staple of electro-acoustic exploration. 
They have also found their way in commercial music instruments, 
but mostly as cheap threshold trigger pads delivering way below their 
potential. Only a handful of commercially available instruments, 
namely Korg’s Wavedrum (1994), Zamborlin’s Mogees [5] (2014) 
and the ATV aframe [6] (2017) have put them to much more ade¬ 
quate use by feeding the excitation signal into a digital resonator. In 
the context of research a variety of implementations for experimental 


and affordable instruments with acoustic interfaces have been pro¬ 
posed. From ceramic tiles as a source for percussive sounds [7], to 
acrylic sheets instead of guitar strings [8], [9] or intricate prototypes 
with vibration insulated pads for eight fingers [10]. 

1.3. Marrying control and exciter 

Miller’s tiles [7] and Momeni’s Caress [10] consider the process¬ 
ing of the contact microphone as sufficiently expressive. Cook’s 
Nukulele [11] combines two sensors, one at audio rate and one at 
control rate, to create the affordance of an Ukulele which is played 
with both hands on different positions of the instrument. As one 
would with a guitar, a hand controls the parameters while the other 
provides an excitation signal. Former is the control rate input and 
latter the audio rate input. 

The Kazumi by Zayas is an instrument which combines capaci¬ 
tive sensing and piezoelectric microphones on the same surface [12]. 
It features seven separate faces in a prismatic heptagonal shape. Each 
of the faces has a copper capacitive sensing layer which divides it 
into six areas from bottom to tip, combined with a piezo mic under¬ 
neath. 

We want to augment the sound signal with additional parame¬ 
ters, so we simultaneously track the position of touch on the surface. 
This way we make a second hand for generating parameters obso¬ 
lete. (Figure 1) Our implementation creates a percussive instrument 
which can be hit, but also can be melodic and played in continuous 
gestures by rubbing, scratching, or bowing on its edge. 



Figure 1: Hybridity of audio and control data 


1.4. Instrument versus controller 

Great effort has been put into abstracting controller hardware to be¬ 
come universal input devices for software instruments. The generic 
controller is an interface to change parameters on the synthesizer in 
which the actual sound is generated. In our instrument it’s not so 
easy to define where the controller ends, and the instrument starts. 
Cook writes that “...many of the striking lessons from our history 
of intimate expressive musical instruments lie in the blurred bound¬ 
aries between player, controller, and sound producing object.” [11], 
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In our instrument we are using the actual audio signal from the sur¬ 
face which then is fed into a digital filter on the computer. In effect, 
a significant component of the final sound is defined by the spectrum 
and gesture of the excitation signal. While in the literature the term 
‘hybrid controller 7 is found [9] we prefer to describe the Tickle as 
an ‘acoustic interface 7 . In our opinion ‘hybridity 7 is too generic and 
there is no declaration of its components, while ‘acoustic interface 7 
adds clarity to its nature. 

2. THE TICKLE 

The following section describes the components of the instrument. 

2.1. Hardware 

The case is made of bent steel with wooden side panels. Its top sur¬ 
face is a printed circuit board and has a capacitive touchpad, three 
endless rotary encoders with associated RGB LED and up/down but¬ 
tons (for transposition or other parameters). On the back are six 
ports: 

1. External in (if plugged-in it mutes the built-in sensor) 

2. CV out Y axis (0-4 V) 

3. CV out X axis (0-4 V) or note 

4. Host (micro-USB port) 

5. Gate or envelope (0-5 V) 

6. Excitation (audio signal) 

2.2. Surface 

After a brief evaluation of piano key layouts and variations thereof 

[13] it was concluded, that a piano key layout is contradictory to 
the intended interaction with the instrument. A hexagon pattern 
was chosen to have equal distanced and sized 1 segmentation with¬ 
out empty spaces on the surface. It is also found in other electronic 
instruments and controllers, for example, the Synderphonics Manta 

[14] . From the 8-Bit resolution in X and Y axis we can calculate in 
which of the 14 hexagons printed on the surface a touch occurred. 
The capacitive touch sensing is single-touch, so polyphony cannot 
be achieved by simultaneous touches. A two or more point gesture 
will produce erroneous ghosting touch points and thus needs to be 
avoided while playing. However, with voice allocation we can let 
one touch resonate while a new touch gets its own resonator, so sub¬ 
sequent touch events may have overlapping resonances. 

2.3. Material and Texture 

To create an acoustic excitation signal we rely on a hard material that 
captures the spectra of different gestures. In addition to the rigidity 
of the material, a textured surface is essential to create enough noise 
when rubbed and wiped. Silicone surfaces are not suitable for our 
application since they absorb too much of the subtle interaction. 

2.4. Residual and Resonance 

Generally, we want the physical surface of the instrument to resonate 
as little as possible, so that we can feed the dry residual signal of the 
touch gesture (rub, scratch, hit, flick, bow etc.) as excitation sig¬ 
nal into a digital resonator (See also [7]). This way the full power 

1 except for the hexagons at the edges 


of physical modeling synthesis algorithms may be accessed. The 
practice of sending generated noise-bursts or clicks into digital res¬ 
onators which can be found in literature for physical modeling and 
which is still the standard in many soft- and hardware implementa¬ 
tions is crippling the true potential of such algorithms. 

2.5. Synthesis 

For the sound synthesis we employ techniques of digital reverbrators 
which at their heart are delay lines, feedback and filters. They can 
be understood as modeled simulations (waveguides and mass-spring 
models) of the physics happening in real instruments as described 
by Smith [15]. These models can be generated with Berdahl and 
Smith's Synth-A-Modeler compiler [ 16] which has received a graphi¬ 
cal interface with Vasil’s SaM-Designer [17]. Synth-A-Modeler gen¬ 
erates Faust code which can be compiled in a variety of other formats 
such as a Pure Data external. With the Pure Data object pmpcT 
from Henry’s PMPD [18] library which creates static mass and spring 
models, we achieved nice sounding string, plate, and gong topolo¬ 
gies. However, we are not aiming for perfect recreations of classic 
instruments, our interest lies in the exploration of synthetic sounds 
with an acoustic and intimate level of control. Algorithms such as the 
nested comb filter delay as described by Ahn and Dudas [19] prove 
interesting and fun to interpret with our instrument while being sur¬ 
prisingly cheap to compute. We can employ our acoustic interface to 
excite extended, hybrid and abstract cyberinstruments as described 
by Kojs et al. [20], Convolution methods with samples can be useful 
to digital Foley artists to articulate a sample in a plenitude of varia¬ 
tions. 



2.6. Software Architecture and Code 

Our hardware is based on a Cypress PSoC 5 microprocessor and runs 
a firmware which is digitizing the capacitive sensing surface and the 
signal from the piezoelectric sensor. It communicates to a custom 
kernel driver which is then communicating to user-space software 
like our Pure Data external or a VST-plugin. Our kernel driver for 
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Linux as well as the Pure Data external are published under a free li¬ 
cense. A repository of the source 2 is available (mirrored on github 3 ). 

2.7. Drivers and Communication 

A great challenge was to transmit control rate signals married to a 
stream of audio with a stable latency and reliable offset to each other. 
The capacitive sensing reports every 4 ms a position while the audio 
streams with a sample rate of 48 kHz and a block size of 64 samples. 
Currently the user-space software is expected to match these settings 
to work reliably. We wrote our own Linux kernel driver receiving 
this isochronous stream of control and audio rate signals via USB 
from the device. 

3. STANDARDS FOR TRANSMISSION AND STORAGE 

We believe that acoustic interfaces will soon become a category of 
their own and manufacturers will introduce hybrid controllers to the 
market. To make these new devices work with synthesis software 
there will have to be a standardization effort for interoperability. 
McMillen and Thew published a proposal on how to send sound 
spectrum information over MIDI and OSC [21]. However many ques¬ 
tions are yet to be answered about which format and standard should 
be used for audio and data. A plethora of further questions arise 
when thinking about a possible integration of a track with control 
and audio-as-synthesis-source into a DAW. With this publication and 
the open source driver we wish to start a discussion about possible 
open standards for transmission, storage, and integration of analogue 
interfaces into the creative workflow of musicians. 

3.1. Specifications for the Driver 

Our aim is an isochronous transfer of data and audio rate signals 
with minimal latency, and more importantly, with little jitter [22]. 
The touch position data needs to be present before the audio arrives 
to be able to tune the synthesis. There can’t be any variation to the 
offset between signal and data. The audio stream doesn't need to be 
continuous; it could start on the touch event and end with it. In a fu¬ 
ture polyphonic version, several audio streams could exist in parallel. 
The implementation could be a data protocol with (multichannel-) 
audio streaming segments on demand, as well as an continuous au¬ 
dio stream with additional data interwoven. The touch events should 
refer to a specific sample in the audio, possibly with a timestamp. 
Other interface data like extra knobs, faders, potentiometers or ro¬ 
tary encoders don’t need this precision in timing. 

3.2. Surveyed Communication Protocols 

We’ve considered different established and experimental protocols. 
Each was evaluated against the aforementioned goals. 

1. A kernel module driver was our choice, as it gives us the 
maximum amount of control to make sure it meets our crite¬ 
ria. However, it needs an installation procedure. On Windows 
and Mac OS the operating system vendor restricts who can 
distribute kernel modules, in fact we have paid Apple and ap¬ 
plied for kernel signing and are still waiting for any response 
after 5 months. On Linux, Secure Boot needs to be deac¬ 
tivated or the kernel extension manually signed. A custom 

^Source code: https://gitlab.chair.audio/explore/projects 
3 Github mirror: https://github.com/chairaudio 


kernel driver means additional development overhead and for 
the customer the fear that the device will be rendered useless 
if support ends. 

2. Audio spectrum data (via MIDI or OSC). Another approach 
would be to break down the audio into metadata and then send 
this over established protocols like MIDI or OSC which would 
allow for a partial reconstruction. This was proposed in the 
aforementioned draft by McMillen [21], We dismissed this 
approach because we see it as necessary to include a full audio 
stream to reduce the latency required for the analysis of such 
descriptive meta information. It also creates a computational 
overhead on both, the sending and receiving device. 

3. Audio and MIDI Class Compliant drivers are a viable al¬ 
ternative. It’s possible to use one USB connection providing 
two virtual devices, an audio interface, and a HID or a MIDI 
device. Using standards means compatibility, no driver in¬ 
stalls and continuous support. However, it’s not guaranteed 
that latency and offset will be consistent. Another problem 
lies in limitations of popular proprietary DAWs like Ablet on 
Live , which will only allow the use of one sound card at a 
time. Assuming that the sound synthesis happens in a plugin 
of the DAW, this restriction would block the plugin to access 
the audio device. 

4. Control Data as Audio Signal. Control data may be sent as 
signals at audio rate, not unlike control voltage in synthesizers 
or upsampled sensor output in Wessel’s Slabs, which features 
96 channels of audio [23]. It could also be encoded as fre¬ 
quencies and later be decoded with a Fourier transformation 
like the Nuance as described in Michon et al. [24]. 

5. MIDI 2.0 There is no indication that MIDI 2.0, which is cur¬ 
rently in prototyping stage at the MIDI Manufacturers As¬ 
sociation, will include the feature to send audio streams for 
acoustic interfaces. 

This list claims no completeness, for example we have not sur¬ 
veyed protocols like Ultranet or AVB. It’s likely we have overlooked 
something and there may be a sensible solution to our problem al¬ 
ready available. 

4. FUTURE WORK 

Future research may be conducted to implement the following fea¬ 
tures to the instrument: 1. Multi-touch to relieve from ghosting 
issues when two fingers touch the surface simultaneously. It also 
allows for polyphony later on. 2. Pressure sensing [25] either for 
every point or at least globally for the whole surface. 3. Haptic feed¬ 
back is challenging to implement due to the feedback into the sensor, 
but can give the user a much more intense sense of reality. The Lofelt 
Basslet[26\ is a good demonstration of such a device. 4. Integrated 
sound synthesis either implemented by a) analog circuitry or b) an 
embedded computing platform, for example the Bela board [27], 5. 
Playful interfaces to manipulate mass-spring models in real-time as 
seen in Allen’s Ruratae [28], 
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Figure 3: One of the more unconventional and unintended ways to 
play the Tickle 


5. CONCLUSIONS 

Our instrument Tickle combines several well-known techniques and 
technologies which on their own are not new. Touch pad, contact 
microphone, and physical modeling synthesis have been around for 
decades. However, in their combination they synergize to a power¬ 
ful intuitive instrument which allows for a natural and intimate [29] 
interaction with precise and reproducible control over sound. Feed¬ 
ing an analogue excitation signal into a (digital) resonator can cre¬ 
ate familiar as well as alien sounds. Sounds which either behave 
like instruments we know: Violin, guitar, snare drum, cymbal, gong, 
marimba, etc., or sounds which are distinctly synthetic but have an 
analogue touch to it. 4 

With this paper we hope to have shown the necessity of sam¬ 
ple accurate, low latency and jitter free communications for acoustic 
interfaces and started a discussion on how to achieve it. 
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ABSTRACT 

This paper presents an ongoing project focused on the co-design and 
co-creation of a small orchestra of digitally fabricated digital mu¬ 
sical instruments (DMIs) based on the Bela board, an open-source 
embedded computing platform. The project took place in Fab Labs, 
an international network of digital fabrication laboratories 1 . The or¬ 
chestra, named Game\Lan, is inspired by the traditional Indonesian 
Gamelan ensembles, their music and philosophy. The project aims 
to explore the capabilities of the Fab Lab network which runs on an 
open-access, open-source and open-hardware ethos, for a distributed 
project of this type. The aspiration is to create an original orches¬ 
tra for non-musicians, which offers the rich collective experience of 
being in a music group and explore it as a medium for social inter¬ 
action. This paper presents the first results of the research project 
which took place is three Fab Labs in South America and it focuses 
on the process and the development of the project. 

1. INTRODUCTION 

In the last two decades, a large number of digital musical instruments 
have been developed by the sound and music computing community 
[1].[2], The international conference for New Interfaces for Musical 
Expression 2 , annually hosts numerous music technology research 
projects related to musical expression and to digital luthiers. How¬ 
ever, very few projects are designed and made by participatory meth¬ 
ods and techniques. The Input Devices and Music Interaction Lab¬ 
oratory at McGill University has co-developed the McGill Digital 
Orchestra which involved collaboration between researchers, com¬ 
posers and performers. More recently, the Augmented Instruments 
Laboratory at Queen Mary University of London, has started devel¬ 
oping a research trajectory related to participatory design and co¬ 
design of digital musical instruments. [3],[4], 

This paper presents the process of development of a digital musi¬ 
cal instrument with participatory design and creation methods: brain¬ 
storming sessions, workshops, hands-on experimentation etc. Differ¬ 
ent approach has been adopted for each stage of the project depend¬ 
ing on the resources and research area of each Lab. Focus was given 
equally to the physical body of the instrument as well as its electronic 
and digital component where an embedded computing platform for 
low-latency audio was used and programmed. The sound synthesis 
algorithms have been designed and developed as an iterative pro¬ 
cess; it was not possible to employ true participatory techniques in 
this case as the participants had no necessary experience or necessary 
skills in music signal processing. 

1 https://fablabs.io/labs/map 

2 http://www.nime.org/ 
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The first section of the paper gives and overview of the open 
design, co-design and co-creation culture and the Fab Lab network. 
Section two presents the concept behind this project and outlines 
the basic idea behind the orchestra, the requirements and constraint 
of the approach. Finally, section three focus on the design and the 
making of the instrument during the residencies that the authors had 
in three Fab Labs in South America. 

2. CO-DESIGN AND DIGITAL FABRICATION 

2.1. Fab Labs 

In the recent years the maker movement has started emerging, in 
part because of people’s need to engage passionately with objects in 
ways that make them more than consumers [5],[6]. Particularly the 
Digital Fabrication Laboratories, so called Fab Labs, form part of 
a larger “maker movement” of high-tech do-it-yourselfers, who are 
democratising access to the modern means to make things [7],[8], 

Fab Labs are often seen as open-innovation contexts in which 
lead users can develop innovation that may become commercial so¬ 
lutions from which companies can profit. But they may also be seen 
as platforms for broader participation and new ways of collaborative 
engagement in design and innovation, pointing at alternative forms 
of user-driven production [9]. 

The reason why Fab Labs were chosen over other type of mak- 
erspaces is the fact that the philosophy of the Fab Lab Network is 
the collaboration between its Labs. The fact that each Fab Lab has to 
share same machines and processes allows for information, projects 
and people to move freely between them. Also, fabricating the in¬ 
strument with the principles and practices of a Fab Lab means that 
anyone can download the open designs, customise them if they need 
to and fabricate them in any Fab Lab around the world. 

2.2. Co-Design and Co-Creation 

Co-design is being used as an umbrella term for participatory de¬ 
sign and collaborative design. Participatory Design, seen as design 
of Things, has its roots in the movements toward democratisation of 
work places in the Scandinavian countries. In the 1970s participation 
and joint decision-making became important factors in relation to 
workplaces and the introduction of new technology [10]. Co-design 
breaks the rules between the traditional designer-client relationship 
and allows for creative contribution to design decisions. Without 
excluding the designers in the process, it recognises the important 
role of the users' participation in the design decisions, as experience 
experts. This research uses the method of participatory design, a 
human centered design approach that attempts to involve users and 
experts to assist in the design process in order to ensure the usability 
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of the product design[l 1], The authors have applied and adapted the 
Participatory Design methods in the Fab Lab environment depend¬ 
ing on each user group. Participatory research methods[12], [13] 
that involve hands on processes and Fab Lab principles both take 
the same approach of testing feasibility in all stages of work. The 
authors followed the five stage design thinking model proposed by 
the Flasso-Plattner Institute of Design at Stanford (d.school). The re¬ 
search was therefore conducted in 5 steps: empathise, define, ideate, 
prototype, test 3 4 . For the first two steps a mind map was drawn on 
a whiteboard, as qualitative data collection tool for generating ideas. 

3. CONCEPT 

The concept of the project was to co-design and co-fabricate locally 
a series of elegant and simple to use embedded digital musical in¬ 
struments for non-musicians. The aim is to create a small orchestra 
similar to the philosophy of the Gamelan Orchestra [14] and explore 
it as a medium for social interaction. The percussion-type instru¬ 
ments would be plug-and-play and easy to perform creatively with¬ 
out necessarily any musical background. It is worth mentioning that 
most Gamelan ensembles, especially in the UK, allow people of all 
ages and abilities to take part. Both authors of the paper were part 
of the Cardiff Gamelan ensemble and found very inspiring this fact 
which eventually constituted one of the main reason to approach the 
GamelLan project orchestra in a similar way 3 4 5 . 

A very important aspect of the project was its participatory char¬ 
acter and ethos. The instruments had to be co-designed and co¬ 
created locally, in Fab Labs. Each Fab Lab with its particular focus, 
skills and expertise, would contribute to the project accordingly. The 
authors planned to visit three to four Fab Labs in South America and 
work for a short period of approximately one week with the makers, 
engineers, entrepreneurs and designers in their premises. 

Moreover, it is worth mentioning that this is a mobile project 
and follows the authors’ idea of "how to make almost anything while 
travelling". The authors wanted to test how feasible is to do creative 
work while travelling, following a digital nomads lifestyle 6 . Every 
single destination would serve as a source of inspiration and every 
Lab would contribute uniquely to the realisation of the project. Ide¬ 
ally each Lab would develop its own instrument, aligned to its local 
culture and geographical location. This idea was proven to be too 
ambitious for the time spent in each Lab and although many proto¬ 
types were fabricated in each place, one final instrument was pro¬ 
duced at the very end of the trip. 

Material and technical-wise, the project had to be digitally fabri¬ 
cated, with open design files and with the machines and technologies 
shared within the Fab Lab network: 3D printers, CNC machines, 
laser cutters, high resolution milling machines for printed circuit 
board milling, electronics and microprocessors. Since the majority 
of the Fab Labs do not focus on DMIs, the authors had to provide 
the necessary embedded computing platforms for the development 
of low-latency audio applications. For that reason, the Bela board 
has been chosen, an open-source embedded computing platform and 
Pure Data visual programming language [15]. Other alternative plat¬ 
forms more widespread in the Fab Lab community such as the Ar- 
duino with the ATMega328 chip or the ATtiny microcontroller were 

3 http://www.nime.org/ 

4 https://www.interaction-design.org/literature/article/5-stages-in-the- 

design-thinking-process 

5 http://artsactive.org.uk/2018/02/09/cardiff-gamelan-community-group/ 

6 https://nomadlist.com/ 


not appropriate even with extra boards to support audio input and 
output. The Raspberry Pi could be an alternative but it would also 
need other peripherals [16], 

For the sound creation component of the instrument, the inten¬ 
tion was to design and develop a simple sound synthesis system, 
which would generate timbres and sequenced music material that 
would be mapped intuitively to the physical interface. Since the per¬ 
formers wouldn't be musicians it was important to make it easy to 
them to create quite rich musical output with simple gestures. 

4. PROCESS 

The co-design and co-fabrication sessions of the project were car¬ 
ried out in three Fab Labs in South America: The Fab Lab in the 
University of Chile in Santiago, the Fab Lab Lima in Peru and the 
Fab Lab of the National University of Colombia Medellin. It is worth 
noting that these three sessions, were very different in nature and ap¬ 
proach. Furthermore, the participants were not researchers from the 
DMI community nor were they professional instrument players or 
digital luthiers, but mainly active members of the Fab Lab network 
and the Maker movement. That was not necessary a complication 
in the co-creation process since the instrument addressed this type 
of performers. Below it is presented chronologically how each Fab 
Lab contributed to the project and how the authors approached the 
collaboration with the teams in each location. 

Fab Lab - University of Chile 

Fab Lab U. de Chile 7 is housed in the Engineering School of Uni- 
versidad de Chile in Santiago. The Fab Lab quickly embraced the 
GamelLan project idea and invited us to work with three of their 
core team, to discuss our ideas on the physical and digital interac¬ 
tion, form, fabrication method and electronic design. 

After having presented the idea and discussed the available re¬ 
sources, the authors collected the information from the mind-maps 
and started drawing out all important points as discussed with the 
team onto a whiteboard (see figure lj.The points proved to be our 
compass for agreeing on a good size, form and interaction; deci¬ 
sions that were made collectively. The figure below shows how the 
team defined some parameters that would be followed throughout 
the project. It was equally important to embed the Fab Lab ethos 
into the project, the mobile nature of the instrument, the electronics 
restrictions, the aesthetics and the Gamelan philosophy. 

Further to the research and decisions taken by the team, the in¬ 
strument had an approximate size of 250x150x150mm with an en¬ 
closure that would fit the microcontroller, battery and sensors. The 
first prototype was done on day three and from there on, we could 
easily test the interaction. The decision taken was that different faces 
would allow for a certain tilting of the instrument which would work 
well with the physical and digital interaction. 

The physical structure of the musical instrument embedded sen¬ 
sors, very simple signal conditioning circuits and a small single¬ 
board computer for audio and sensor signal processing. The Fab 
Lab community commonly uses the Arduino board or directly its 
ATMEGA single-chip microcontroller which unfortunately does not 
allow on board audio processing. As mentioned on section 3, the 
Bela board has been chosen for its audio specifications and because 
it is very well integrated with Pure Data, a very well known open- 
source programming language for computer music applications that 

7 http://www.fablab.uchile.cl/ 
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Figure 1: Points to be considered during design decisions 


is aligned with the open-source philosophy of the Fab Lab commu¬ 
nity. 

In our prototypes in Santiago, the team used a two-axis accelerom¬ 
eter, a piezoelectric sensor and three reed switches. An algorithm 
was developed in order to detect the active face of the polyhedron 
from the readings of the accelerometer and accordingly influence the 
signal processing algorithms. The piezoelectric sensor was measur¬ 
ing pressure on the faces of the instrument which was used either as 
an audio input or as trigger of samples. The reed switches and the 
three magnets acted as a 3-bit digital input signal that affected the 
settings of the instrument. All these electronic components were sol¬ 
dered on a perforated board. An electronic engineer from the local 
team helped with the electronic development and started program¬ 
ming for the first time in Pure Data. 

One of the concepts in Santiago that the team developed, was to 
have an ensemble of maximum eight reconfigurable, modular and in¬ 
terchangeable instruments. During the music performance, the play¬ 
ers would mix the top with the bottom parts of their instrument in 
order to increase the dramaturgy and the physicality of the perfor¬ 
mance. This gesture would change the settings of the instrument 
such as the timbre family or the sequenced music patterns triggered 
by the performers. The reed sensors mentioned above where used 
for that reason. 



Figure 2: Co-design and prototyping in Fab Lab U.de Chile 


In the first prototype, the instrument was sampled-based, playing 
back randomly a collection of samples coming from the same family 
of sounds. That was enough in order to test the interaction design 
and study how feasible was for the performers to play the instrument 
together. A simple score system was devised , similar to the Game- 
lan Kepatihan notion, where the number would indicate the face to 
be slapped. The first author was part of the Gamelan orchestra in 
Cardiff in UK for five years and he was aware of the level of dif¬ 
ficulty of performing music with this type of notation. As already 
mentioned before, one of the main reason why the Gamelan philos¬ 
ophy was adopted for this project was the quick access the beginner 
performers have, to play notated music within the context of an or¬ 
chestra. The score was briefly tested with non-musicians in Santiago 
and was confirmed that learning curve is very smooth and beginners 
could easily engage with that type of orchestra. More information 
on the process can be found on the authors' website 8 

Fab Lab - Fab Lab Lima 

Fab Lab Lima 9 is a community Fab Lab therefore rather than work¬ 
ing with the Fab Lab team, we organised a workshop open to the 
public with knowledge in either a design related field or electronics, 
programming or fabrication. We spent two days with a multidisci¬ 
plinary group of participants with diverge backgrounds ranging from 
architecture to mathematics, biology, art, electrical engineering, civil 
engineering as well as members of the community interested in the 
project. Each one chose to contribute to one of the three areas of in¬ 
terest as designed by the authors: instrument form and design, elec¬ 
tronics and programming and 3d prototyping in collaboration with 
the design group. During the time in Fab Lab Lima the authors re¬ 
peated the last 3 stages: ideate, prototype, test. 

On the second day of the workshop we experimented with differ¬ 
ent materials and processes as textiles and weaving, parametrically 
designed forms and 3d printing etc. Moreover, the electronics were 
further developed and a PCB board was designed according to the 
circuit developed in Santiago, Chile. More information on the pro¬ 
cess can be found on the authors’ website 10 

The rest of the time we worked in the Lab refining the interaction 
design and programming it in Pure Data. Different sound synthe¬ 
sis algorithms where programmed there and presented to the partic¬ 
ipants. One interesting one, passed the audio signal coming directly 
from the piezoelectric sensor to a bank of parallel band-pass filters. 
The central frequency and the Q factor of the filters was mapped to 
the orientation of the body of the instrument and the performers by 
tilting it could generate a variety of unexpected sonic textures such 
as rain drops. 

Fab Lab - National University of Colombia Medellin 

Fab Lab UNAL * 11 is in Medellin, in the Arts and Architecture School 
of the National University of Colombia. During our week in the 
Fab Lab, we worked with the Lab’s team to co-design a parametric 
12 shape for the instrument and fabricate the result in wood. Para¬ 
metric design and CNC milling was this Lab’s strongest asset so we 
experimented with both. 

8 https://www.stiwdioeverywhere.com/2018/04/20/making-in-fab-lab-u- 

de-chile/ 

9 https://www.fablabs.io/labs/fablablima 

10 https://stiwdioeverywhere.com/2018/05/09/making-in-fab-lab-lima/ 

11 https://www.fablabs.io/labs/fablabUNmedellin 

12 https://www.grasshopper3d.com/ 
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Figure 4: Making in Medellin 


Figure 3: Co-design and prototyping in Fab Lab Lima 


The team in Colombia had a particular interest in the digital fab¬ 
rication aspect of the project, testing different types of wood for the 
end result. Oak, eucalyptus and pine were available to use at the Lab, 
and after testing the weight, acoustical properties and the milling bits 
to be used in each case, the team decided to use pine for the two-part 
instrument as illustrated in figure 5. We made three prototypes out of 
pine wood to test the size, ergonomics and wood texture and acous¬ 
tics. The authors decided to repeat stages 3,4 and 5 of the methodol¬ 
ogy: ideation, prototype and testing. Without a major change in the 
ergonomics of the instrument, the final result was slightly bigger than 
the size agreed in Fab Lab U. de Chile, simply because the geometry 
generated by the algorithm was more complicated. The bottom part 
enclosed the electronics circuits and had 6 main faces that were used 
to produce different sounds depending on which angle the performer 
would decide to tilt it at. More information on the process can be 
found on the authors’ website 13 

The circuit diagram and PCB layout for through-hole compo¬ 
nents designed in Peru was given to the team for milling. Unfortu¬ 
nately due to software implications, the drivers of the milling ma¬ 
chine were not working and there was no alternative way of produc¬ 
ing the board with a process used by the Fab Lab community. The 
widely known etching technique is not supported by the Fab Lab net¬ 
work which is focused to more computer-aided-manufacturing ap¬ 
proaches. 

For the sound generation part of the instrument, a different ap¬ 
proach closer to algorithmic composition has been explored and pro¬ 
duced higher lever of musical material. A number of short musical 
phrases were composed or generated algorithmically, which could 
be repeated and triggered interactively by the performers. Each face 
of the polyhedron triggered a different phrase randomly or in a pre¬ 
defined order. Musical parameters of the phrase such as its tempo 
and dynamics were mapped to the orientation of body. The perform¬ 
ers could articulate the phrases, control how many times they are re¬ 
peated and when they will start playing. This procedure was inspired 
by In C by Terry Riley. 


13 https://stiwdioeverywhere.com/2018/05/21/making-in-fab-lab-unal- 
medellin/ 



Figure 5: Two part CNC milled prototype in Medellin 


5. CONCLUSIONS 

The GamelLan project was an interesting experiment, trying to match 
the participatory approach in design and fabrication with the culture 
of the digital nomads.The different teams have managed to develop 
one finalised instrument and equally importantly to share knowledge, 
skills and ideas beyond their cultural barriers. The authors were flex¬ 
ible and worked with each Lab in a different way, respecting the 
diversity within the Fab Lab network. Unfortunately, there was no 
time left to experiment musically or perform with the instrument. 
Upon reflection, there are a few areas for improvement and points to 
consider for others who decide to do a similar project: 

1. It was not an easy task to accomplish especially while trav¬ 
elling. The authors spent 8 days working in Fab Lab U.de 
Chile and managed to go through all stages of the design. In 
the other two locations they had to spend less time. 

2. An ambitious project that would normally take a certain of 
amount of time in one’s local Fab Lab, may take up to three 
times more time in other places especially when one is not fa¬ 
miliar with the local settings. This does not apply for smaller 
projects or projects in collaboration with university students. 
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3. The Fab Labs’ website that shows the location, machines and 
activity of each Fab Lab in the world needs an update: not all 
places were active or had the equipment needed and this cut 
the project short. 

Despite the points above, the authors managed to gather an im¬ 
portant body of knowledge related to the project, a series of alter¬ 
native design ideas fabrication methods. The important points high¬ 
lighted during the first days of the project in Fab lab U. de Chile 
set the rules, the design values to be followed. This part proved 
to be vital to the project, not only during the first week in Chile, 
but throughout the whole duration of the project. The participants 
whether this was in Colombia or Peru, understood and respected the 
decisions that were taken collectively by the first team in Chile. It 
was difficult for the participants to make sure they would address 
all the points when co-designing and prototyping the instruments in 
each place, however they happily accepted the challenge . There were 
always points where new decisions were discussed and tested; this 
gave a sense of empowerment and ownership in each place. 

The overall challenge of co-creation, especially when not all par¬ 
ticipants have collaborated before, may delay the final result. How- 
ever, each person’s knowledge, ideas, or experiences added signif¬ 
icant value to the project. Co-creation in spaces like the Fab Labs 
seems to come naturally by its members and the authors are opti¬ 
mistic that there will be more examples in the future. 

This is work in progress; future work includes improved, longer 
in duration workshops where one instrument per location will be fab¬ 
ricated. All designs and music scores are to be uploaded on a web- 
based hosting service for version control such as GitFlub so they are 
accessible to the community and step by step instructions and docu¬ 
mentation of the fabrication are to be shared on the authors’ website. 
Moreover a series of concerts are envisaged that could take place 
remotely as network performances or in the International Fab Lab 
conferences. 
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ABSTRACT 

We present a novel robotic implementation of an embedded linux 
system in Shimi, a musical robot companion. We discuss the chal¬ 
lenges and benefits of this transition as well as a system and techni¬ 
cal overview. We also present a unique approach to robotic gesture 
generation and a new voice generation system designed for robot au¬ 
dio vocalization of any MIDI file. Our interactive system combines 
NLP, audio capture and processing, and emotion and contour analy¬ 
sis from human speech input. Shimi ultimately acts as an exploration 
into how a robot can use music as a driver for human engagement. 

1. INTRODUCTION 

The field of robotics depends on embedded hardware and software 
for real-time computational tasks such as kinematics, computer vi¬ 
sion, and sensor data processing. For many of these tasks, state-of- 
the-art performance depends on computationally heavy deep learn¬ 
ing techniques. Embedded computing devices have only recently 
been developed with the GPUs necessary to perform complex deep 
learning inference in real-time. One such device is the NVIDIA 
Jetson TX2, an embedded system-on-module that runs Linux on 
a quad-core ARM processor, and features an 8GB GPU built on 
NVIDIA’s Pascal architecture. This powerful and energy-efficient 
device greatly expands the capabilities of robots and other embed¬ 
ded applications alike through its ability to run both high CPU and 
GPU tasks, such as artificial neural networks, deep learning, and sig¬ 
nal processing. 

This project uses the Jetson TX2 to run a musical robot com¬ 
panion named Shimi (Figure 1). Shimi moves with five degrees of 
freedom, and can play audio out of two speakers on either side of its 
head. Additionally, Shimi features a 4-microphone array on its un¬ 
derside. Prior to being run by the Jetson TX2, Shimi was controlled 
with an Android smartphone and an Arduino Mega. 

The purpose of Shimi is to explore novel ways in which humans 
can communicate with artificial intelligence (AI) agents. Many mod¬ 
ern AIs attempt to replicate communicative patterns of humans as 
closely as possible, using state-of-the-art text-to-speech procedures 
and complex mechanical operation to try and convince users that 
they interact with a human-like device, not a computer or a robot. 
This can quickly lead to the "uncanny valley" psychological phe¬ 
nomenon, where the small differences between an AI and a real hu¬ 
man evoke a deeply unsettling feeling. In this project, the authors 
embrace the non-human robotic identity of Shimi and explore meth¬ 
ods of communication using Shimi’s limited range of motion and 
music, in place of verbal language. This is realized through a voice 
generation system that utilizes deep learning to respond to human 
speech in an emotionally relevant manner, and a gesture generation 
system that uses both quantified emotion and Shimi’s musical voice 
to craft robotic body language using Shimi’s five degrees of freedom. 



Figure 1: The musical robot companion Shimi. 


2. RELATED WORK 

Prior work on Shimi focused first on utilizing the sensors and com¬ 
putational power of a smartphone to explore the possibilities of per¬ 
sonal robotics in a cost-effective way [1], The research in this study 
also provided inspiration for life-like gestures, taking cues from an¬ 
imation. Other work on Shimi explored expressing emotion through 
gesture, informed by observations of human movement and emotion 
from Darwin [2, 3]. Others have used the Laban Effort System in 
gesture generation, specifically in low degree of freedom robots such 
as Shimi [4], Additionally, speech analysis as input to gesture gen¬ 
eration has been used for robot communication in many cases such 
as Kismet [5], 

Music as a vector for emotion has been demonstrated in numer- 
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ous studies, with comprehensive research exploring what emotions 
can be perceived or induced through music, what musical features 
encode emotion, and how music expresses or induces emotion [6]. 
Studies have shown clear correlations between musical features and 
movement features, suggesting that a single model can be used to 
express emotion through both music and movement [7], Addition¬ 
ally, humans demonstrate patterns in movement that is induced from 
music [8]. 

3. TECHNICAL DESCRIPTION 

3.1. Voice System 

3.1.1. Input Analysis 

Shimi analyzes incoming audio streams using a combination of nat¬ 
ural language processing (NLP) and raw audio analysis. Shimi fea¬ 
tures a Seeed Studio ReSpeaker Mic Array v2.0 1 , a four-microphone 
array with on-board processing that combines each microphone 
stream and denoises the recording, emphasizing voice signals. No 
additional processing of input signals was added after the ReSpeaker 
processing, other than down-mixing to a single channel. Using the 
open-source hotword detection library Snowboy 2 , Shimi responds 
to the phrase "Hey Shimi," and begins recording input audio. The 
Python phrase detection library speech_recognition 3 is then 
used to capture one phrase of raw audio. 

Incoming audio is analyzed using the valence arousal model, 
whereby valence is the measure of the positivity or negativity of an 
emotion, and arousal is the measure of the energy of an emotion[9]. 
Raw audio analysis is used to find the arousal level, pitch, intensity 
and onsets. To do this we utilized Par selmouth 4 , a Python library 
built on Praat 5 . We created custom metrics to analyze the input 
level based on analysis of the Ryerson Audio-Visual Database of 
Emotional Speech and Song (RAVDESS) data set [10], RAVDESS 
includes 7356 audio files by 24 actors, each rated with an emotion 
independently validated by 10 participants. Our metrics were based 
on pitch contours and intensity levels found in the recordings. Figure 
2 and 3 show analysis of the phrase the dogs are sitting by the door 
from the data set. Our metrics to measure arousal use the variety, 
level and standard deviation in intensity and the range, contour and 
standard deviation of pitch. 

To measure valence we use the Natural Language Toolkit (NLTK) 
[11], a suite of Python modules for NLP. We calculate valence us¬ 
ing a built in naive bayes classifier trained on the NLTK data set of 
tagged phrases from social media. We also use the NLTK library for 
statement classification. 

3.1.2. Shimi’s Emotion 

Shimi maintains its own emotional state through each communica¬ 
tion, tracked through a position in valence and arousal. Valence and 
arousal are both measured between -1 and 1. The current model 
gradually shifts the valence level towards that of the user while mir¬ 
roring the arousal of the user. A negative valence statement from the 
user will cause Shimi to respond in a sad tone. Following positive 


1 http://wiki.seeedstudio.eom/ReSpeaker_Mic_Array_v2.0/ 

2 https://snowboy.kitt.ai/ 

3 https://github.com/Uberi/speech_recognition 

4 https://github.com/YannickJadoul/Parselmouth 

5 http://www.fon.hum.uva.nl/praat/ 


statements from the user will gradually move Shimi towards posi¬ 
tive responses. When starting Shimi begins with a valence of 0.5, 
equating to slightly happy. 


3.1.3. MIDI Dataset and Phrase Generation 

To control Shimi’s vocalizations we generate MIDI phrases that then 
drive the synthesis and audio generation described below and lead 
the gesture generation. For this purpose we created our own data set 
of MIDI files tagged with a valence and arousal quadrant. We col¬ 
lected MIDI files from eleven improvisers around the United States. 
Each was told to record MIDI phrases between 100ms and 6 sec¬ 
onds with each phrase assigned one of the quadrants from the va¬ 
lence/arousal model. They also recorded phrases that they believe 
represented a question, an answer to a question, a greeting and a 
farewell. Improvisers were told to record between 50 to 200 samples 
of each category. To restrict the data each phrase could only contain 
velocity values at the start of a note and no MIDI data outside pitch, 
velocity and rhythms were included in training (i.e. no expressive 
modulations). 

As the data set was created by many improvisers we created a 
second process to confirm the validity of the collected files. This 
was done through a comparison of the pitch, velocity and contour 
variation between the new MIDI data set and the RADVESS data 
set. Figure 3 and Figure 4 present the an example of the variance 
in the data-set between different emotions (blue is pitch, orange is 
intensity, placed over a spectrogram). Any MIDI file that varied too 
far from the features of RADVESS was removed from the data set. 
Table 1 shows the final amount of files used for Shimi's phrase gener¬ 
ation. The RADVESS data set does not include greetings, farewells, 
questions or answers and due to their limited use in Shimi’s interac¬ 
tion we did not post process these phrases. 


Table 1: Shimi Emotional MIDI Data set 


PhraseType 

MIDI Samples 

Post Process 

V Al(Happy) 

895 

400 

V A2(Angry) 

1042 

621 

V A3(Sad) 

980 

567 

V AA{C aim) 

700 

385 

Greetings 

655 

655 

Farewell 

895 

895 

Question 

901 

901 

Answer 

778 

778 


To generate phrases for Shimi vocalizations, we choose to use a 
data driven generative method. We also considered using the samples 
recorded by improvisers directly, however we wanted to aggregate 
the features created by all improvisers and develop a system that 
allowed limitless variability. Having chosen to use deep learning a 
relatively simple Long short-term memory, recurrent neural network 
(LSTM RNN) was implemented in Keras over Tensorflow as has 
been previously presented [12][13]. This type of neural network is 
useful for this task as it is sequential and considers parts of its input 
as it creates output, encouraging the creation of musical phrases. The 
data set was first transposed into all twelve keys, to avoid a need to 
identify a key center. Eight different versions of the network were 
trained, one for each tagged component of the data set. This was 
done with the goal of a faster run time. 
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Figure 2: Shimi System Overview. 


3.1.4. Audio Creation and Synthesis 

MIDI phrases are fed to a new synthesis system created for Shimi. To 
generate vocalizations that focuses on emotions devoid of all seman¬ 
tic meaning, we chose to construct a new vocabulary. Shimi’s vo¬ 
cabulary is built upon phonemes from the Australian Aboriginal lan¬ 
guage Yuwaalaraay a dialect of the Gamilaraay language. Originally 
ideas explored real-time implementations of deep learning raw audio 
synthesis, however it quickly became apparent that this would add 
unacceptable amount of latency to the system. In our testing even 
with large compromises in bit rate we were never able to achieve 
less than a 1 to 5 ratio of processing sound (1 second took 5 seconds 
to process). Instead of real-time synthesis we compromised by inter¬ 
polating 28 language samples with four different synthesizer sounds, 
manually created by the authors. For each sound three different in¬ 
tensity levels were recorded at two different octaves, giving a total 
of 672 wave samples each 500 ms long. Our final interpolation was 
done using a modified version of NSynth[14], trained on the NSynth 
data set. Sounds are played back using a synthesis engine that time 
stretches and pitch shifts the wave samples to match the incoming 


MIDI file. 

3.2. Gesture System 

Much like in human communication, Shimi’s gestures are tightly 
coupled with speech [15]. The voice system produces three outputs: 
an audio file of Shimi’s speech, the MIDI musical representation of 
the audio, and quantitative measures of Shimi’s current emotion. The 
latter two outputs are the inputs to a rule-based generative gesture 
system, which controls synchronized playback of gesture with the 
generated audio. 

The first step in gesture generation is musical feature extraction 
from the MIDI representation of Shimi’s speech. Using the Python 
libraries pretty_midi 6 and music21 7 , musical features such 
as tempo, range, note contour, key, and rhythmic density are ob¬ 
tained. These features are used to create mappings between Shimi's 
voice and movement; for instance, pitch contour is used to govern 


6 https://github.com/craffel/pretty-midi 

7 https://github.com/cuthbertLab/music21 
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-0 

0.0 0.5 1.0 1.5 20 25 3.0 3.5 

time [s] 


-600 


7000 

6000 

5000 

4000 

3000 

2000 

1000 


8000 


Happy 


-500 


-400 . 


- 300 


-100 
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Python thread responsible for sending motor control commands across 
the duration of the gesture. 

The motors used in Shimi are Dynamixel MX-28 actuators pro¬ 
duced by Robotis. They feature built-in controllers, allowing for 
closed-loop control through half-duplex UART serial communica¬ 
tion. While the MX-28 motors allow for both reading and writing 
of position and speed, the half-duplex nature of their communication 
introduces latency when reading and writing to multiple motors at 
once, at a resolution high enough for smooth movement. To gener¬ 
ate rigorously timed gestures, we do not read Shimi’s motors write 
to them as infrequently as possible. This minimizes any latency in¬ 
herent in the transmission of data to the motors. For smooth and 
natural-looking movement, the velocity curve of a gesture is most 
important. As such, position of Shimi’s motors is only ever set when 
direction of movement changes, and velocity changes are set as fre¬ 
quently as possible without accruing latency. Setting position once 
and defining the velocity curve allows for control of both when Shimi 
reaches a certain position, and how Shimi gets there. 

Gestures, then, are defined as sequences of movements to a posi¬ 
tion over a specified time. To facilitate programmatic gesture gener¬ 
ation, a collection of velocity curves have been implemented to pro¬ 
vide styles of movement. The simplest is a constant velocity, where 
velocity is the distance of the movement over its duration (Figure 5). 
This style looks the most stereotypically “robotic”, as the motors can 
accelerate from rest to max velocity much faster than a human can. 

Previous work on Shimi introduced a velocity curve that features 
a constant acceleration until the midpoint of the gesture, then a con¬ 
stant deceleration [1], This works particularly well for single move¬ 
ment or broad gestures, and looks the most realistic when compared 
with human motion (Figure 5). 

In the context of a multi-move gesture, however, accelerating 
and decelerating every movement becomes unnatural, as multi-move¬ 
ment human gestures do not come to rest bewteen each move. Thus, 
a constant acceleration (or deceleration) and constant velocity curve 
can cap both ends of a gesture. An example of the acceleration vari¬ 
ety is shown in Figure 5. 


Shimi’s torso forward-and backward movement. Other mappings in¬ 
clude beat synchronization across multiple subdivisions of the beat 
in Shimi’s foot, and note onset-based movements in Shimi’s up-and- 
down neck movement. These mappings are based on research inves¬ 
tigating correlative features in music and musically-induced move¬ 
ment [8, 7, 16], 

The next step uses the emotion state of Shimi to condition Shimi’s 
movement. Emotion is provided to the system in the form of contin¬ 
uous-valued valence and arousal. These values are then used to con¬ 
dition the musical mappings formed previously. In general, arousal 
is used to restrict or expand range of motion, and valence is used 
to govern the amount of motion Shimi exhibits, though exact usage 
varies for each degree of freedom. 

In addition to musical and emotional mappings, some degrees 
of freedom are interdependent. For example, as Shimi’s torso moves 
forward, Shimi’s head naturally moves forward and toward the ground. 
This affects where Shimi is looking, so it is important to consider 
Shimi’s torso position when generating neck up-and-down move¬ 
ment. To accommodate this, the movement paths of Shimi’s de¬ 
grees of freedom are generated sequentially and in full, before be¬ 
ing actuated together in synchronization with the audio of Shimi’s 
speech. This is implemented using the built-in threading library 
in Python, with each degree of freedom being associated with one 


Velocity Curves 



Time [s] 

Figure 5: Graphs of the velocity curves used for Shimi movements. 

In addition to the movement sequencing method of gesture gen¬ 
eration, a different method of recording and playing back gestures 
is being explored. This method requires physically moving Shimi’s 
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limbs in a desired gesture while the motors continuously record po¬ 
sition and speed as fast as possible. After recording, the captured 
positions and speeds can be used to actuate the gesture on Shimi on 
demand, resulting in a highly detailed and smooth gesture. While 
this method results in the most nuanced and expressive gestures, 
there are difficulties in playing back recorded gestures accurately 
in time with the way they were recorded. The time taken to read 
a motor’s position and speed varies, resulting in playback that is not 
aligned with the recording. This timing behavior makes synchro¬ 
nization with speech, which is a necessity for Shimi, very difficult. 
More research on ways to align these types of gestures with audio is 
being explored. 

4. APPLICATIONS AND FUTURE WORK 

This work has described Shimi's ability to generate musical and ges¬ 
tural responses to human speech input that attempts to replicate the 
emotion conveyed in a spoken phrase. These short form interactions 
provide insight into how robots can express emotion and communi¬ 
cate with music. A next step in communication will be seeing how 
accurately Shimi can imitate a phrase, both vocally and, more im¬ 
portantly, emotionally. We are also interested in expanding Shimi’s 
musical phrases to include more languages and improvisers of dif¬ 
ferent origins. 

Shimi originated as a musically-intelligent speaker dock, and the 
work presented here can extend to more musical applications as well. 
One possibility is as a nuanced music recommendation system. In 
this system a human would ask Shimi if they would like a song, and 
Shimi would reply with a vocalization and gesture demonstrating an 
opinion of that song. This way of expressing opinion can be much 
more detailed than the thumbs up/thumbs down of many music ser¬ 
vice providers today. Another engaging musical experience furthers 
a previous goal of the Shimi project: to enjoy one’s music alongside 
a human listener. Now that Shimi has a voice, the ability to dance 
along with one's music can incorporate singing along as well. This 
could also lead to Shimi as a robotic performer, listening to human 
performers and improvising alongside as a vocalist. 
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ABSTRACT 

We present a haptic floor composed of tiles with independently con¬ 
trollable vertices and designed to cover arbitrary large flat surfaces. 
We describe the signal distribution architecture, based on SATIE, 
our spatialization engine and SWITCHER, our low latency and mul¬ 
tichannel streaming engine. The paper also provides a description 
of several approaches of content authoring when such a floor is de¬ 
ployed in an immersive space. These approaches emphasizes the 
correlation among immersion modalities such as continuous local¬ 
ization of sound from the speaker system to the floor and continuous 
physical effect from the video projection to the floor. 

1. HAPTIC FLOOR FOR LARGE IMMERSIVE SPACES 

We question and experiment the extension of a large immersive space 
with a haptic floor covering the entirety of the surface. Largely mo¬ 
tivated by augmentation of artistic venues with audience, we are in¬ 
terested in those with i) the ability to offer immersive listening for a 
group of people, ii) visual immersion, and iii) floor space allowing 
for a case-by-case configuration of audience position (sitting, stand¬ 
ing, lying down) and along with configuration of performance space. 

Not limited to the above characteristics, the haptic floor proto¬ 
type we propose has a scalable design and a flexible authoring possi¬ 
bility, targeting tight relation with the audiovisual effect in the venue. 
Indeed, as demonstrated by the scientific literature, augmentation of 
immersion with haptics could increase the perception of virtual en¬ 
vironments by an audience, more specifically when combined with 
immersive sounds and visuals. Vection 1 , for instance, when stim¬ 
ulated by the actuators placed between the ground and the feet of 
a sitting subject, is obtained in a shorter time and with more in¬ 
tensity when the haptic feedback (constant frequency sinusoidal vi¬ 
brations) is applied[l]. Similar results are obtained for a standing 
posture, without any impact on the impression of presence[2]. Hap¬ 
tic feedback can contribute to increased perceptive sensibility of an 
individual^, 4], The above cited research, however concern experi¬ 
ences involving individual users. From this point of view, developing 
a floor simulating haptic feedback of walking on particular ground 
textures[5], such as snow, offers an opportunity for a collective sen¬ 
sory experience. 

Unfortunately the above cited research does not apply directly 
to our work. The devices employed in the previous research con¬ 
sider pre-determined posture of the subject, particularly in the case 
of walking. Other approaches and applications remain to be explored 
that offer creators an immersive and flexible space where different 
experiences can be quickly prototyped. 

* Society for Art and Technology [SAT] is an artist center in Montreal spe¬ 
cializing in dissemination of art made with new technologies with the focus 
on immersive arts and experiences. 

1 Illusion of self-motion 


As our primary source of inspiration, our immersive space called 
the Satosphere (see figure 1) is a large dome-shaped audiovisual pro¬ 
jection space offering the view of the horizon (floor to ceiling projec¬ 
tion) and 360°at the same time. Over 11 meters high and 18 meters in 
diameter, the Satosphere is equipped with 157 loudspeakers grouped 
into 31 adjacent clusters on the dome’s surface, and with 8 video 
projectors that distribute the video image across the dome’s surface. 
Other venues around the world provide a listening environment for 
spatial audio, where our research could apply. The CUBE [6] of the 
Virginia Tech is an immersive space dedicated to sound. It offers 
a significant spatial resolution with its 124 audio channels and pro¬ 
vides for audio spatialization techniques, including movement cap¬ 
ture. The ALLOSPHERE [7] offers 360°vertical and horizontal im¬ 
mersion. The lack of a floor is compensated by a bridge, allowing 
to go to the center and experiment with different data visualization 
strategies and audio-visual immersive compositions. Unfortunately, 
this constraints the viewer to assume standing position and move 
only around the narrow bridge. Another example, in France, Espace 
DE PROJECTION at IRCAM provides rotating panels to offer several 
acoustic profiles[8]. It hosts 75 speakers arranged in a cube. Finally, 
the team at the Center for Computer Research in Music and Acous¬ 
tics at Stanford University has developed the GRAIL[9], a system 
of 32 speakers and 8 subwoofers that can be deployed in different 
locations, such as outdoors, concert halls or studios. There are other 
venues equipped with speaker setups that accommodate listening to 
spatial music, a non-exhaustive list provides[10]: the IEM-Cube 
and MUMUTH in Graz, the MULIT in Bergen, the MOTION Lab 
in Oslo, the SPACE in Pesaro and the DIGITAL MEDIA CENTER 
THEATRE in Baton-Rouge. 

Our work on the haptic floor is, to our knowledge, unique in 
treating the question of creation and reproduction of content for hap¬ 
tic feedback in the context of immersive space for groups of partici¬ 
pants in non-specific postures (see figure 1). In this case, the device 
is designed to cover arbitrarily large surfaces, its hexagonal shape 
(see figure 2) allows for easy assembly and fitting into large, flat sur¬ 
faces, such as the Satosphere's floor. 

In this paper, we present the prototype of a floor built to pro¬ 
vide haptic feedback during experiences designed to posture agnostic 
content created for large immersive spaces (Figure 1). Our prototype 
represents one segment of a device that could cover arbitrarily large 
floor surfaces and is driven by audio signals delivered via local net¬ 
work. It is integrated into our spatialization software, SATIEfll], 
therefore it is tightly coupled with the immersive content. This is 
illustrated by our demonstration video showing our floor integrated 
in an immersive space 2 . 


2 See our demonstration video (accessed Dec. 2018): 

https://vimeo.com/290925507 
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(a) The audience can freely move around the space and interact with telep- (b) The audience is lying down and fills most of the floor space (Plateaux 

resent space or simply sit around the center (Miscible by Maotik and by Vincent Brault, Owen Kirby and Vincent Martin) 

Manuel Chantre 

Figure 1: Examples of scenographies in our immersive space, illustrating the need for a haptic floor to be posture-agnostic, group-friendly and 
correlated with audiovisual displays. 


2. OUR HAPTIC FLOOR PROTOTYPE 

As seen in Figure 2, the hardware prototype is only one portion of 
the projected floor, consisting of a single hexagon divided in six tri¬ 
angles forming a mesh. An actuator is placed at each of the 7 vertices 
and is controlled individually with an audio signal ranging between 
0 and 100 Hz for a height amplitude of 38.1 mm. The shape has 
been designed to easily scale up to the surface of larger space by 
multiplying the hexagonal components. 

The signal distribution pipeline (Figure 3) consists of an audio 
Tenderer (a computer with ubuntu Linux 18.04) equipped with an 
appropriate audio I/O and a set of Raspberry Pis running Raspbian. 
Each RasPi is required to control three actuators (the 250i model 
from our partner D-Box 3 ). The audio tenderer is based on a Super- 
Collider script combining specific signal processing and our spatial- 
ization engine SATIE described in more detail in section 3.1. 

The audio tenderer runs SATIE 4 , our spatialization engine that 
provides for use of multiple spatializers in parallel[12]. In this case, 
one rendering is performed for entire haptic floor, along with the 
existing 8-speaker audio display (or the dome). The coupling of the 
haptic floor and speaker system allows for keeping some coherence 
in experience design, thanks to an internal per-rendering handling of 
the same OSC [13] message. 

The audio Tenderer can handle a variety of inputs. Direct audio 
signals correspond to sound objects in SATIE that can be spatial- 
ized. OSC messages are interpreted in two possible ways, one with 
the SATIE protocols that allows for sound object control (location, 
spread, etc) and the other by controlling position of each actuator 
independently. 

Audio spatialization is handled via an audio interface wired to 
the speaker setup. The haptic floor is handled via LAN connecting 
Raspberry Pi devices, each talking to a custom USB audio interface 


3 D-Box is a company which designs, manufactures, and markets actua¬ 
tors intended mainly for the entertainment and industrial simulation markets, 
https : / /www. d-box. com/en, accessed Dec. 2018 

4 https ://gitlab . com/sat-metalab/satie, accessed Dec. 

2018 


which controls up to 3 actuators (where one actuator affects one ver¬ 
tex of the floor’s “mesh”). The latter allows for a flexible increase 
in number of floor subparts, adding just more Raspberries in the net¬ 
work. 

On the software side, SATIE handles all input cases (audio and 
control signals) and performs spatialization for both the physical 
speaker system and the haptic floor. The audio signals destined for 
the traditional speakers are handled directly with the audio interface. 
The audio signals that control the haptic floor need to be sent over 
network to the Raspberry Pis. Low latency streaming from the au¬ 
dio Tenderer to the Raspberries is achieved using SWITCHER 5 , our 
multichannel and low latency streaming engine. The transmission 
of audio streams from SATIE to switcher is done through the jack 
server. 

3. AUTHORING FOR HAPTIC FLOOR 

The challenge lies in designing haptic content that is appropriate to 
the type of immersive experience, and more particularly when sound 
and graphics are involved. Here follow some use cases where haptic 
floor control can be correlated with immersive content: 

• locate the sound in the floor in order to continue a sound tra¬ 
jectory 

• propagate waves from a sea displayed on screen to mechani¬ 
cal waves on the floor 

• ripple effect as well as delivering of different types of haptic 
content to different areas of the floor at the same time, corre¬ 
sponding to drops falling from the sky 

• control vibration of the floor according to the sound played 

3.1. Audio spatializer based rendering 

SATIE [11, 14], written in the Supercollider language [15], provides 
rendering of virtual audio scenes spatialized over many audio chan- 

5 https : / /gitlab. com/ sat -met a lab /switcher, accessed 

Dec. 2018 
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(a) Concept drawing of one subpart of a larger haptic floor. 7 actu¬ 
ators (two of them, the black cylinders, can be seen on the left) are 
controlling each vertices independently. 



(b) Our haptic floor prototype with a cube shaped 8 speaker system. We use the SATIE dual 
rendering feature in order to provide continuous spatialization among the speaker system and 
the haptic floor. 


Figure 2: Flardware design of a floor subpart. The current prototype is only one section of the target haptic floor and. 



Figure 3: Distribution pipeline. Our prototype, in addition to : use 
three Raspberry Pis and three D-Box interfaces, each one controlling 
three actuators. 


nels. We were able to build upon our previous experience with near- 
field/far-field audio rendering[12] and tackle the floor as another au¬ 
dio display because the provided actuators transforms digital audio 
signals to mechanical movement. This approach provides a few ben¬ 
efits. First of all, the synchronization of audio signals, after compen¬ 
sation for the delay between audio displays, is handled by SATIE and 
does not require any other work. Secondly, haptic content creation 
can be approached in parallel with audio creation and spatialization 
design. 

We have experimented with different approaches to spatial au¬ 
dio such as VBAP, ambisonics and a crude equal power panning. All 
types of audio spatialization work well and the choice of approach 
will depend on the desired effect and audio content. We also applied 
an envelope tracker filtering in order to convert audio signals into sig¬ 
nals compatible with our actuators that respond well to frequencies 
between 0 and 100 Hz. 

Moreover, since SATIE can handle many types of control inputs 
(audio, USB, network), it can still assist in delivering synchronized 
audible and haptic audio signals. Additionally we can take advantage 
of Supercollider’s powerful synthesis and DSP capacity to experi¬ 
ment and design audio signals suitable for the haptic floor with a lot 
of flexibility. Finally, the spatialization of audible audio signals and 
audio delivered to the haptic floor can be completely independent, 
which also offers the necessary creative freedom. 

As one of our first experiments, in collaboration with D. An¬ 
drew Stewart 6 , we explored the use of the floor prototype driven by 
8 discreet channels of analog audio from an electronic performance 
instrument based on Omnisphere VST plugin, controlled by Karlax 
controller. The audio channels were spatialized as 8 independent 
sound objects on the octophonic, cube-shaped speaker layout, each 
acting on the floor depending on the spatialization parameters. We 
have experimented with OSC messages sent from Karlax directly 
to SATIE as well as via our interactive creation tool for immersive 
spaces, ElS[ 16]. Through this short experimentation with live per¬ 
formance, we found that using the sound of the instrument to drive 


6 http: //dandrewstewart. ca/, accessed Dec. 2018 
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(a) Live control from Ableton (concept and development from Mourad Bennacer). The 
orientation from the device (left) is applied to the floor (right). 





(b) Mapping with a 3D Mesh: the blue mesh (upper left) is a 
virtual representation of the haptic floor and part of a larger 
mesh. Here the hand detection with Leap motion (bottom) 
allows for touching the floor (Concept and development from 
Sebastien Gravel and Vincent Brault). 


Figure 4: Authoring content for the haptic floor with other software (Ableton and Touch Designer) using an OSC protocol allowing real-time 
control of each actuator independently. 


the movement and the texture of the floor creates a deep sense of 
coherence. In fact, placing performers on the floor can have its ben¬ 
efits. 

3.2. Other approaches 

The other approaches we describe here are based on the control of 
each actuator independently from Open Sound Control[17], Mes¬ 
sages from external software are composed of several float values 
ranging from -1 to 1, each one being the desired height of the corre¬ 
sponding actuator. Along with our audio spatializer based rendering, 
they are illustrated in our demonstration video 2 . 

The first one, illustrated in Figure 4a, is a live control from Able¬ 
ton. The basic protocol has been implemented in Ableton where 
Ableton specific creations can be used in order to create a tight rela¬ 
tion between the sound and/or an Ableton controller with the haptic 
floor. This allows, for instance, to program automated floor vibra¬ 
tions synchronized with the audio track. In our demonstration video, 
a simple mapping from the orientation of an accelerometer-equipped 
device with the orientation of the haptic floor has been implemented. 

The second approach is a mapping of the floor with a 3D Mesh 
(Figure 4b) in a 3D software. Accordingly, any physics or interac¬ 
tion applied in the virtual environment becomes a source of vibration 
possibly applied to the floor. This provides the potential for strong 
correlation of the visual with the floor. With this approach for in¬ 
stance, sea waves from a simulation can be displayed from the screen 
with a consistent continuity in the floor. In our demonstration video, 
a hand tracking system, the Leap Motion, is used in order to move a 
virtual hill along a planar mesh. 

4. CONCLUSION & NEXT STEPS 

This paper has presented our experience with distribution architec¬ 
ture of a scalable haptic floor targeting posture-agnostic multi-person 
immersive spaces. The floor is scalable in space thanks to its tri¬ 
angular shape allowing for unlimited tiling. Its signal distribution 


scalability is ensured using low latency multichannel streaming to 
Raspberry Pis, each one dedicated to groups of three actuators. 

Surprisingly, experiments with our prototype have pointed us to¬ 
wards an uncharted territory of haptic feedback, both from techno¬ 
logical and creative points of view. This led us to describe in this 
paper a set of methods for authoring content for floor-involved im¬ 
mersive content: i) using the floor as an additional “audio display” 
driven by audio and/or using multi-speakers spatialization algorithm 
and ii) producing content from other software, including 3D graphic 
engine, with the help of a basic OSC protocol providing independent 
control of each actuator height. 

Our next steps will be targeting experiments with a larger scale 
floor covering our dome with approximately 200 actuators. Along 
with physical design and construction methods, we will go forward 
with improvement of authoring methods for group of users and vali¬ 
dation of the architecture scalability. 
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ABSTRACT 

The paper introduces midizap, a new Linux utility to interface MIDI 
controllers with multimedia applications such as audio and video 
editors or computer music programs, midizap is a heavily mod¬ 
ified version of Eric Messick’s ShuttlePRO program. Its purpose 
is to translate MIDI controller input to commands (either MIDI or 
XI1 keyboard and mouse events) which the application understands. 
Configurations are simple text files, no programming skills are re¬ 
quired. There’s also an Emacs mode to help creating and testing 
these configurations. Jack session and MIDI patchbay functional¬ 
ity is available as well, making it easy to manage separate midizap 
instances for different controllers and applications. 

1. INTRODUCTION 

These days, MIDI controllers are typically USB class devices which 
can be connected to a Linux computer without requiring any spe¬ 
cial hardware or drivers. Also, they’re often much cheaper than spe¬ 
cialized gear for specific uses such as photo and video editing. So 
wouldn't it be nice if we could just use whatever MIDI controller 
we have for controlling our favorite multimedia applications? The 
problem is, while DAW and DJ programs typically have extensive 
and customizable MIDI interfaces built into them, other applications 
may not offer any MIDI support at all, or only recognize a particular 
set of MIDI messages. Thus we often have to translate the MIDI in¬ 
put from the controller to whatever keyboard or MIDI commands the 
application understands, and we’d like to be able to do this without 
having to modify the target application. 

I was surprised to find that on Linux apparently there’s no sim¬ 
ple and practical solution for this problem yet. There is the Ctlra and 
Mappa software from the OpenAV project [1], but it is still under 
development and only readily supports a handful of devices and ap¬ 
plications right now, which means that adding a new controller or ap¬ 
plication likely requires a fair amount of C programming. A popular 
commercial program in this realm is the Borne MIDI translator [2], 
but it’s only available for Mac and Windows. 

Another interesting utility is Eric Messick’s ShuttlePRO pro¬ 
gram [3] which targets the Contour Design “Shuttle” devices [4] 
designed for video editing. These devices don’t speak MIDI, but 
Messick’s program is free (GPL) software, works on Linux, and in¬ 
cludes the necessary code to recognize applications by their window 
name and translate device input to XI1 keyboard and mouse events. 
Adding Jack MIDI support to it seemed to be a piece of cake, so 
that’s what I set out to do. The first result of this side project was 
a fork of the ShuttlePRO program which improves the original pro¬ 
gram in some ways and adds Jack MIDI output [5]. The next obvious 
step then was to replace the Shuttle input with Jack MIDI input, giv¬ 
ing birth to the midizap program as it stands now [6], 

In the following sections, we discuss midizap’s most important 
features and some typical uses. For lack of space, this description 


is necessarily somewhat terse and incomplete, but should give the 
interested reader an idea of what capabilities the program offers and 
when you might want to use it. More details can be found on the 
Github project page or in midizap’s extensive manual. 

2. TRANSLATION SYNTAX 

As with the ShuttlePRO program, midizap’s configuration is a sim¬ 
ple text file which is divided into sections for different applications. 
A sample configuration is provided in /etc/midizaprc, you can copy 
this to create a .midizaprc file in your home directory and edit it 
there as needed. You can also run midizap with any other configu¬ 
ration file by specifying the name of the file on the command line. 
A collection of configurations for various purposes (mostly Mackie 
emulations for different devices) can be found in the examples folder 
in the sources. 

The configuration language is line-oriented, each line is either a 
section header or a translation rule. The hash sign # at the beginning 
of a line or after whitespace starts a comment. Each section starts 
with a header of the following form, specifying a section name and 
a regular expression pattern: 

[name] pattern 

The section name is only used in diagnostic messages and can 
essentially be chosen freely. It is the regular expression pattern which 
actually determines whether the translations in the section are active 
at any given time. To these ends, midizap matches the pattern against 
the WM_CLASS and WfLNAME properties of the currently selected X ap¬ 
plication window. The latter is what is actually visible in the window 
title, while the former is an internal property which identifies the type 
of application window. 1 The regular expression pattern can also be 
omitted, in which case the translations will always be active. Such 
“default” sections are to be placed near the end of the file, and their 
translation rules will be used as fallback translations when none of 
the other translation sections in the configuration match the selected 
application window. 

The section header is followed by a list of (zero or more) trans¬ 
lation rules describing the translations which should be active for 
the given application. These just list MIDI messages and their trans¬ 
lations in a human-readable symbolic format. Each translation rule 
must be on a line by itself and consists of a single left-hand side 
symbol denoting the MIDI message to be translated, followed by the 
right-hand side which is a list of zero or more symbols specifying the 
MIDI messages and/or keyboard and mouse commands to be output. 
It thus takes the following general form: 

input output 1 output 2 . . . 

'You can find out about the WM_CLASS and WM NAME properties of a win¬ 
dow with the xprop program, or invoke midizap with the - d r debugging op¬ 
tion to have it print this information, midizap will try to match both by de¬ 
fault, but you can tell it explicitly to only match class or title by prefixing the 
pattern with the CLASS or TITLE token, respectively. 
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Here is a simple example: 

[Terminal] CLASS ~(.*-terminal.*|konsole|xterm)$ 

F5 XK_llp 

F#5 "pwd" 

G5 XK_Down 

G#5 "Is" 

A5 XK_Return 

This defines a list of translations for some common types of ter¬ 
minal windows, as specified in the section header on the first line. 
The input messages are listed on the left and the corresponding key¬ 
board output on the right. Here we map a few notes in the middle 
octave to the cursor up and down and return keys, as well as some 
frequently used shell commands. The bindings above will let you 
operate the shell front your MIDI keyboard when the keyboard fo¬ 
cus is on a terminal window. To make this work, you’ll first have 
to connect your MIDI controller to midizap’s MIDI input port, e.g., 
using a Jack MIDI patchbay program like QjackCtl. You then click 
on the desired terminal window and start entering notes on the MIDI 
keyboard to have the corresponding commands sent to the selected 
window. 

It is important to note here that, like the ShuttlePRO program, 
midizap will only ever send keyboard and mouse commands to the 
currently selected window or, more precisely, the window which has 
the keyboard focus. The selected window also determines which sec¬ 
tion of translation rules is currently active. Thus you have to make 
sure that you first click on the right application window before you 
can go on sending keyboard and mouse commands to it. (In con¬ 
trast, MIDI commands can be sent to any application as long as it is 
connected to midizap’s MIDI output, see below.) 

Let’s now have a closer look at the syntax of translation rules. 
The precise syntax is a bit intricate, so we have to refer the reader 
to the EBNF grammar in Appendix A for details. But we will try 
to at least sketch out the most important elements in what follows. 
The first token of a translation rule (the left-hand side) denotes the 
MIDI message to be translated, which is followed by an output se¬ 
quence (the right-hand side) consisting of MIDI messages or X key 
and mouse events. There can be any number of these, and you can 
freely mix MIDI messages and X events on the output side. 

The XK symbols indicate X key codes and must be denoted ex¬ 
actly as they appear in the /usr/include/Xl 1/keysymdef.h file. A 
string enclosed in double quotes is simply a shorthand for a sequence 
of X key events. 2 Besides the key codes from the header file, there 
are also some special tokens to denote mouse button and scroll wheel 
events (XK_Button_l, XK_Scroll_Up, etc.). 

MIDI note messages are denoted in a symbolic format that will 
be familiar to musicians: a note letter (A to G) is followed by an 
optional accidental (# or b) and an octave number. By default, C5 
denotes middle C, but the octave numbering can be changed with a 
directive in the configuration file. Other kinds of (non-system) MIDI 
messages are denoted using short mnemonics: KP -.note (aftertouch 
a.k.a. key pressure for the given note); CCn (control change for the 
given controller number); PCn (program change for the given pro¬ 
gram number); CP (channel pressure); and PB (pitch bend). These 
can all be followed by a dash and the MIDI channel (the default 
MIDI channel being 1). 

In the example above, all note messages are interpreted as key 
events, having an “on” and “off” status: the key goes “down” when 

2 In the current implementation, this only works with printable ASCII 
characters which can be mapped 1-1 to XI1 key codes. Otherwise explicit 
key codes must be used. 


a note-on message is received, and goes “up” again when the corre¬ 
sponding note-off message (or a note-on with zero velocity) arrives. 
We also call this a key translation. These work in the same way as in 
the ShuttlePRO program; e.g.. in the above example, the XK_llp key 
is pressed when the note-on for F5 is received, and won't be released 
until the corresponding note-off is detected. If there’s more than one 
key in the output sequence, as with the double-quoted strings in the 
example, each key will normally be released before the next one is 
pressed, and only the last key in the sequence will be held until the 
note-off is received. There are also some special suffixes for key 
specifications (/D, /U, /H) which indicate keys to be held and re¬ 
leased explicitly or at the end of the sequence; we refer the reader to 
the documentation for details. 

As another, more practical example, here are some bindings for 
the Kdenlive and Shotcut video editors mapping some keys and the 
big jog wheel on a Mackie-compatible device to some common video 
editing functions: 

[Kdenlive/Shotcut] CLASS ""(shotcut | kdenlive)$ 

# playback controls 

A#7 XK_space # Play/Pause 
A7 "K" # Stop 

G7 "J" # Rewind 

G#7 "L" # Forward 

# replace/drop (sets in and out points) 

D#7 "I" # Set In 

E7 "0" # Set Out 

# left/right cursor movement 

D8 XK_Home # Beginning 

D#8 XK_End # End 

# the jog wheel moves left/right by single frames 

CC60< XK_Left # Frame reverse 

CC60> XK_Right # Frame forward 

The last two rules for the jog wheel show an example of a data 
translation which translates incremental changes in the extra data 
byte of a message to corresponding X key presses. For ordinary 
(absolute) control changes these take the form CCn- and CC n+, where 
n denotes the controller number, and the - or + flag the direction of 
the change. However, here we employed the special < and > suffixes 
which indicate a relative change in “sign-bit” encoding [7], which is 
commonly used with encoders (knobs or wheels which can be turned 
endlessly in either direction). In either case, the up or down output 
sequence is emitted for each unit change in the parameter. You can 
also scale these responses by adding suitable step sizes on the left- 
hand or right-hand side of the translation rules; again we refer the 
reader to the documentation for details. 

The rules we’ve seen so far all translate MIDI to X key events, 
midizap can also work as a MIDI mapper which translates MIDI in¬ 
put to MIDI output. This is useful if the target application supports 
MIDI, but needs the controller input to be remapped to MIDI com¬ 
mands it understands. The following example lets you play a little 
drumkit on a General MIDI (GM) synthesizer like Fluidsynth, by 
remapping some of the white keys in the 4th octave to a few drum 
notes on MIDI channel 10 (the GM drum channel). We also threw in 
a rule to remap the modulation wheel (CC 1) to the volume controller 
on MIDI channel 10 (CC7 - 10). 3 

3 The notation CC1= being used here provides a shorthand for two data 
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[MIDI] 

C4 C3-10 

D4 C#3-10 

E4 D3-10 

F4 D#3-10 

CC1= CC7-10 

Note that we placed the MIDI translations into a special [MIDI] 
section here. This is a default section reserved for applications ac¬ 
cepting MIDI input. To make this work, you will have to invoke the 
midizap program with the -o option. This enables the [MIDI] sec¬ 
tion and equips midizap with an additional MIDI output port which 
can be connected to the target application (like Fluidsynth in this ex¬ 
ample). As long as your translations only output MIDI messages, 
you then don't have to worry about keyboard focus, as the applica¬ 
tion will receive all data from midizap through the MIDI connection 
(in fact the application does not need to have any X window at all in 
this case). 

The above example does a simple 1-1 mapping of MIDI events, 
but in general the output sequence may consist of as many MIDI 
messages of as many different types as needed, and you can also mix 
MIDI and X keyboard and mouse output if you want. An interest¬ 
ing use case for MIDI translations is Mackie emulation which we’ll 
discuss in Section 4. 

3. GETTING STARTED 

Before we explore some of midizap’s more advanced features, let 
us quickly go over the mundane technicalities of using midizap. 4 
midizap is a command line application, so you typically run it from 
the terminal. However, it is also possible to launch it from your Jack 
session manager (see Section 5 below) or from your desktop envi¬ 
ronment’s startup files once you’ve set up everything to your liking. 
In addition, for Ernacs users there’s a midizap mode which makes it 
very easy to edit and test your midizap configurations. It does syntax 
highlighting, auto-completion of keywords, and also lets you launch 
midizap in an Emacs buffer; please check the midizap-mode.el file 
in the sources for details. 

midizap uses Jack for its MIDI input and output, so you’ll need 
to be familiar with Jack. We recommend using a Jack front-end like 
QjackCtl which makes setting up Jack and doing MIDI connections 
much easier. You’ll also need an ALSA-Jack MIDI bridge in order 
to expose the ALSA sequencer ports as Jack MIDI ports, so that the 
MIDI inputs and outputs of your controller and other non-Jack MIDI 
applications can be connected to midizap. Jack’s built-in bridge will 
work for this purpose (in the QjackCtl setup, select seq as the MIDI 
driver), or you can use Nedko Amaudov’s a2jmidid utility [8], The 
latter is easier to use with Jack2, and will work with Jackl as well. 

Running just midizap without any arguments launches midizap 
with the default configuration and a single Jack MIDI input port 
which you'll have to connect to your MIDI controller. To utilize 
MIDI output, run midizap -o; as already mentioned, this equips 
midizap with an additional Jack MIDI output port to be connected to 
the MIDI application you wish to control. You can also run midizap 
with any other configuration file by simply specifying the name of 
the file on the command line. There are a number of other options 
and configuration Hie directives which let you set the Jack client 

translation rules CC1- and CC1+ with the same right-hand side CC7-10. 

4 We don’t discuss installation here, which is very easy and, besides the X 
libraries, only needs very few dependencies which should be readily available 
on all Linux distributions; details can be found in the README file. 


name, number of input and output ports and the desired MIDI con¬ 
nections; see Section 5. 

Moreover, midizap offers a fair amount of debugging options 
which will be very helpful when you start developing your own con¬ 
figurations. A good set of options to start with is -drkm; r prints 
the class names and titles of selected windows which is useful to de¬ 
termine which regular expressions to use in the section headers; k 
prints out recognized translations so that you can check that midizap 
is actually picking the right translation rules for some given MIDI 
input; and m activates midizap’s built-in MIDI monitor which prints 
out recognizable MIDI input in the same syntax that’s used in the 
configuration file, which makes it easy to figure out which MIDI 
messages you may want to create translations for. 

The default configuration is really just an example, to help you 
get started. You can either edit that file or create your own config¬ 
uration. To start from a clean slate, create an empty file in a text 
editor, say myconfig.midizaprc, and invoke midizap on it. The file 
will be reloaded whenever you save it, so you can just keep on adding 
translation sections and rules and try them out immediately, without 
having to restart the program. If you’re an Emacs user, you will find 
midizap’s Emacs mode most convenient to do all this. 

Let's walk through a simple example to show how this works. 
We’ll use the Shotcut video editor (https://www.shotcut.org/) 
for illustration, so let’s assume that you’ve already launched Shotcut 
and loaded a video file in it. Next, make sure that Jack is running, 
create the myconfig.midizaprc file, run midizap -drkm myconfig 
.midizaprc, and connect your controller to midizap’s MIDI input. 
With the Shotcut window selected, wiggle one of the controls on 
your MIDI gear; I'll take the modulation wheel as an example. In 
midizap’s output you should now see something like: 

Loading configuration: myconfig.midizaprc 
[0] CC1-1 value = 40 

no translation found for Untitled - Shotcut 
(class shotcut) 

This tells you the class name (shotcut) of the application win¬ 
dow, as well as the name of the incoming MIDI message (CC1-1, 
which can also be abbreviated as CC1 in the configuration, as 1 is 
the default MIDI channel). Having identified the application and 
the MIDI message we’d like to translate, we can now edit our con¬ 
figuration in the myconfig.midizaprc file accordingly. Let’s add the 
following section header and translations, and save the file: 

[Shotcut] CLASS A shotcut$ 

CC1- XK_Lef t 
CC1+ XK_Right 

midizap should automatically reload the file. Moving the modu¬ 
lation wheel again (with the Shotcut window still selected) will now 
change the playback position in Shotcut, while the translations we 
just added are printed by midizap. 

4. ADVANCED USES 

One particularly interesting use case for MIDI translations is the em¬ 
ulation of Mackie controllers. The Mackie control protocol (MCP) 
has become a de facto standard for DAW programs, because it allows 
the various track parameters to be mapped without requiring any 
manual setup. 5 Also, many Mackie-compatible devices offer feed- 

5 Although MCP is widely used, there doesn’t seem to be a publicly ac¬ 
cessible specification of the protocol anywhere. A partial description can be 
found at http://www.j jlee.com/qlab/MackieControlMIDIMap.pdf. 
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back , i.e., the ability to display current parameter values and other 
kinds of status information using LEDs, motor faders, scribble strips 
and the like, which makes them very convenient to use. 

Some MIDI controllers have a built-in MCP mode, but many 
don’t. Thus it is tempting to employ midizap to emulate this mode. 
Even if a device already offers MCP, it may be lacking some features; 
this is true especially for some of the cheaper and/or smaller devices 
like the Behringer X-Touch Mini. In such cases midizap may be used 
to beef up the device’s capabilities and/or modify its bindings so that 
they better suit your workflow. 

Emulating MCP usually requires remapping some or all of the 
MIDI messages of the device, on both input and output (if the de¬ 
vice offers some feedback capabilities). Especially the feedback part 
often poses some challenges. The purpose of this section is to dive 
into some of midizap’s more advanced features catering to these use 
cases, using MCP emulation as a running example. Of course, these 
features may also be helpful in other situations calling for compli¬ 
cated translations. 

4.1. Shift State 

One issue we often face right away when designing a Mackie emu¬ 
lation is the number of available controls. For instance, your device 
might only provide you with 8 faders which must then be used to 
emulate both the volume and the panning controls of a Mackie con¬ 
troller. Or it may not have enough buttons for all the special MCP 
functions that you need. In such cases it is useful to designate a spe¬ 
cial shift key on the device which lets you switch between different 
functions of the available controls. 

midizap provides a special SHIFT token for this purpose which 
can be used anywhere on the right-hand side of a translation. This 
token doesn’t produce any output, it merely toggles an internal bit 
indicating the current shift status. This is often used in a key transla¬ 
tion as follows: 

D8 SHIFT 

Now, midizap will go into shift mode whenever the device gen¬ 
erates the note D8 (which happens to be the shift key on an AKAI 
APCmini device, cf. Fig. 1(8); but any available button-like control 
will do). Pressing the D8 key again disables shift mode. Thus the 
above rule implements a “CapsLock”-style shift button. You can 
also do an ordinary shift button as follows: 

D8 SHIFT RELEASE SHIFT 

Here, the RELEASE token indicates an explicit release sequence 
which will be invoked as soon as the D8 key is released (i.e., the 
corresponding note-off is received). Hence pressing this key now 
toggles on the shift status, and releasing it immediately toggles it off 
again, just like an ordinary shift key on a computer keyboard. 

Having defined the shift key, we can now use its current status in 
other translations. The ~ character, when used as a prefix on the left- 
hand side of a translation, tells midizap that the translation should 
only be valid in shifted state. Thus we can now have two different 
rules associated with each incoming MIDI message, depending on 
the current shift status, effectively giving us about twice as many 
controls as we had before. 

Let’s take the AKAI APCmini as an example again. We can map 
the first eight faders CC48 to CC55 on this device, cf. Fig. 1(4), to the 
MCP encoders CC16 to CC23 in shifted mode as follows: * 6 

6 Note that the MCP encoders use relative values in sign-bit encoding; the 
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Figure 1: AKAI APCmini [9, p. 5]. 


''CC48= CC16- 
''CC49= CC17- 

'TC55= CC23- 

The above translations will only be executed in shifted mode 
(i.e., by holding the designated shift key while operating the faders). 
In unshifted mode, the faders are still available to be mapped, e.g., 
to the MCP volume controls (PB-1 to PB-8). For instance: 7 

CC48= PB[128]-1 
CC49= PB[128]-2 

CC55= PB[128]-8 

You will find very similar rules in the APCmini.midizaprc ex¬ 
ample distributed with midizap. We’ve only sketched out the use 
of a single shift key here, but midizap actually supports up to four 
different shift states, which are denoted SHIFT1 to SHIFT4, with the 
corresponding prefixes being 1~ to 4~. The SHIFT token and ~ prefix 
we’ve seen above are in fact just shortcuts for SHIFT1 and T\ re¬ 
spectively. Thus midizap lets you have up to five different “layers” 
of MIDI assignments (1 unshifted and 4 shifted states), which will 
hopefully be enough for most purposes. 

4.2. Feedback 

Some MIDI controllers have motor faders, LEDs, etc., requiring 
feedback from the application. To accommodate these, you can 
use the -o2 option of midizap (or the JACK_PORTS 2 directive in 
the midizaprc file, cf. Section 5), to create a second pair of MIDI 
input and output ports. Use of this option also activates a second 
MIDI default section in the midizaprc file, labeled [MIDI2], which 

~ suffix on the output CC messages indicates that these messages should be 

converted to that special encoding. 

7 The [128] suffix on the PB output messages denotes a scale factor here, 
which scales up the 7 bit CC range to the 14 bit range of a pitch bend. 
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is used exclusively for translating MIDI input from the second in¬ 
put port and sending the resulting MIDI output to the second output 
port. The control output from the application is then connected to 
midizap’s second input port, and midizap’s second output port to 
the input of the controller, so that the feedback from the application 
passes through midizap on its way back to the controller. 

If all this has been set up properly, MIDI feedback will eliminate 
most problems with controls being out of sync with the application, 
midizap has some built-in logic to help with this. Specifically, the 
current state of controls received from the host application via the 
second input port will be recorded, so that subsequent MIDI output 
for data translations on the first output port will use the proper values 
for determining the required relative changes. We refer to this as 
automatic feedback. Some devices may provide you with sign-bit 
encoders which don’t need any kind of feedback for themselves. In 
this case the automatic feedback will be all that’s needed to keep 
controller and application in sync, and you don’t even have to write 
any translation rules for the feedback; just enabling the second input 
port and hooking it up to the application will be enough. 

Other controls such as motor faders will require explicit transla¬ 
tion rules for the feedback in the [MIDI2] section, however. In the 
simplest case these may just be the inverse of the rules in the [MIDI] 
section. For instance, if the APCmini had motor faders (it doesn’t), 
we might use rules like the following to translate MCP feedback 
about the fader positions back to the device: 

PB[128]-1= CC48 
PB[128]-2= CC49 

PB[128]-8= CC55 

Translations can also generate their own feedback. To these 
ends, any MIDI message on the right-hand side of a translation can 
be prefixed with the ! character (or the ~ character, which works in 
an analogous fashion, but has some special logic for dealing with 
shift keys built into it). This outputs the message as usual, but flips 
the output ports, so that the message will go to port 2 in a forward 
translation destined for port 1, and vice versa to port 1 in a feedback 
translation (in the [MIDI2] section) destined for port 2. We call this 
direct feedback. For instance, we can equip the D8 shift key from the 
previous subsection with direct feedback as follows: 

D8 SHIFT ~D8 RELEASE SHIFT ~D8 

This might then light up the LED of the corresponding button 
when pressing and turn it off again when releasing the key. 

Please note that any kind of controller feedback which goes be¬ 
yond direct feedback requires that the target application already pro¬ 
vides some level of MIDI feedback on its own. midizap is not capa¬ 
ble of reading the internal state of a non-MIDI application by some 
other magical means. 

4.3. Mod Translations 

Most of the time, MIDI feedback uses just the standard kinds of 
MIDI messages readily supported by midizap, such as note messages 
which make buttons light up in different colors, or control change 
messages which set the positions of motor faders. However, there 
are some encodings of feedback messages which combine different 
bits of information in a single message, making them difficult or 
even impossible to translate using the simple kinds of rules we’ve 
seen so far. midizap offers a special variation of data translations to 
help decoding such messages. We call them mod translations (a.k.a. 


“modulus” translations), because they involve operations with inte¬ 
ger moduli which enable you to both calculate output from input val¬ 
ues in a direct fashion, and modify the output messages themselves 
along the way. 

One important task, which we’ll use as an example below, is the 
decoding of meter (RMS level) data in the Mackie protocol. There, 
each meter value is represented as a channel pressure (CP) message 
whose value consists of a mixer channel index 0..7 in the “high nib¬ 
ble” (bits 4..6) and the corresponding meter value in the “low nibble” 
(bits 0..3). We will show how to map these values to notes indicat¬ 
ing buttons on the AKAI APCmini (Fig. 1). Mod translations aren't 
limited to this specific use case, however; similar rules will apply to 
other kinds of “scrambled” MIDI data. 

In its simplest form, a mod translation looks as follows (taking 
channel pressure as an example): 

CP[16] C0 

In contrast to the simple kinds of data translations we’ve seen so 
far, there's no increment (+ or -) flag here, so the translation does 
not indicate an incremental change of the input value. Instead, mod 
translations always work with absolute values, and the step size on 
the left-hand side is treated as a modulus to decompose the input 
value into two separate quantities, quotient and remainder. Only the 
latter becomes the value of the output message, while the former is 
used as an offset to modify the output message. 

In order to describe more precisely how this works, let’s assume 
an input value v and a modulus k. We divide v by k, yielding the 
quotient (offset) q = v div k and the remainder (value) r = v mod 
k. E.g., with k = 16 and v = 21, you’ll get q = 1 and r = 5 (21 
divided by 16 yields 1 with a remainder of 5). The calculated offset 
q is then applied to the note itself, and the remainder r becomes the 
velocity of that note. So in the example above the output would be 
the note C#0 (CO offset by 1) with a velocity of 5. On the APCmini. 
this message will light up the second button in the bottom row of the 
8x8 grid in yellow. 

Mod translations are midizap’s swiss army knife for dealing with 
complicated translations. There are also some special elements in the 
MIDI syntax which can be used in mod translations to make them 
even more flexible: 

• The empty modulus bracket, denoted [ ] on the left-hand side 
of a mod translation, indicates a default modulus large enough 
(16384 for PB, 128 for other messages) so that the offset q 
always becomes zero and the translation passes on the entire 
input value as is. 

• The transposition flag, denoted with the ' (apostrophe) suffix 
on an output message, reverses the roles of q and r, so that 
the remainder becomes the offset and the quotient the value 
of the output message. 

• The change flag, denoted with the ? suffix on an output mes¬ 
sage, only outputs the message if there are any changes in 
offset or value. 

• Value lists, denoted as lists of numbers separated by commas 
and enclosed in curly braces, provide a way to describe dis¬ 
crete mappings of input to output values. The input value is 
used as an index into the list to give the corresponding output 
value, and the last value in the list will be used for any index 
which runs past the end of the list. There are also some conve¬ 
nient shortcuts which let you construct these lists more easily: 
repetition a : b (denoting b consecutive a’s) and enumeration 
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a - b (denoting a, a ± 1,..., b, which ramps either up or down 
depending on whether a < b or a > b, respectively). 

We can’t go into all of this here, so we have to refer the reader 
once again to the manual for details. But here’s how we can use a sin¬ 
gle mod translation to map MCP meter feedback onto the APCmini’s 
topmost five button rows, turning them into a colorful meter display: 

CP[16] C2{0,1} G#2{0:3,1} E3{0:6,1} C4{0:9,5} G#4{0:12,3} 

To understand how this works, one must know that the buttons of 
the 8x8 grid, cf. Fig. 1(6), can be lit up by sending them the appropri¬ 
ate note messages. Rows number 4 to 8 (counting from the bottom) 
start at notes C2, G#2, E3, C4 and G#4, respectively. The velocities of 
the notes indicate the colors (0 means off, 1 green, 5 yellow, and 3 
red). The rule above will thus light up buttons in different rows in 
different colors (depending on the low nibble of the channel pressure 
value), and in different columns (depending on the high nibble of the 
channel pressure value which ranges from 0 to 7 and indicates the 
mixer channel). 

Mod translations are surprisingly versatile and can be used for 
various different puiposes. In particular, they can also be called as 
macros from other translations. This adds a (rather rudimentary) pro¬ 
gramming facility to the configuration language, which isn’t needed 
very often, but gives you some extra rope to tackle complicated trans¬ 
lations. We won’t go into this here, so please check the manual for 
details and many more examples. 

4.4. Pass-Through 

There are some situations in which it may be possible to keep most of 
the controller input and pass it through unchanged. In particular, this 
case arises in Mackie translations for devices which already support 
MCP, but might need some minor touches here and there to make 
them work exactly the way you want. 

For instance, Behringer’s X-Touch Mini (Fig. 2) is a fairly nice 
device with its eight encoders providing LED feedback, but its MCP 
mode is somewhat lacking. One thing that many users of the device 
complain about is that it doesn’t have any keys for changing mixer 
banks. But in fact the device has two “layer” keys on the right which 
seem ideal for that purpose; alas, the Behringer engineers decided 
to have them assigned to some other less important MCP functions 
instead. With midizap it’s very easy to fix this shortcoming, by just 
reassigning the two keys to the much wanted bank change keys: 

C7 A#3 # BANK LEFT 

C#7 B3 # BANK RIGHT 

We still need to make sure that everything else is passed through 
unchanged. The most convenient way to do this is to just add the 
PASSTHROUGH directive to the configuration. You can place this any¬ 
where, but it’s most convenient to have this kind of stuff at the be¬ 
ginning of the configuration file, before the first translation section. 
The directive tells midizap to pass a message from the input to the 
output port if it doesn't have an explicit translation for that message. 
So the final configuration will look like this: 

PASSTHROUGH 

[NIDI] 

C7 A#3 # BANK LEFT 

C#7 B3 # BANK RIGHT 

[MIDI2] 


MCMODE 



Figure 2: X-Touch Mini [10, p. 16]. 


# feedback for the BANK LEFT/RIGHT buttons 
A#3 C7 
B3 C#7 

Flere we also added two more translations in the [MID 12 ] section 
so that the feedback for the two remapped buttons works as expected. 
To finish off that little example, you may want to add a few more di¬ 
rectives, so that midizap automatically creates the feedback port and 
auto-connects to the right device and applications; we will discuss 
these in the next section. You can also find an enhanced version 
of this example in the sources (XTouchMini.midizaprc), which adds 
many other useful MCP functions. 

Please note that the PASSTHROUGH directive only applies to nor¬ 
mal (non-system) messages. In some cases it will be necessary to 
also pass on system messages, such as system exclusive, which can 
be done with the SYSTEM_ PASSTHROUGH directive. System exclusive 
messages are used in MCP to set the contents of the scribble strips. 
The X-Touch Mini doesn't have these, but other devices like the X- 
Touch One do, and will thus need system pass-through to function 
properly (see the XTouchONE.midizaprc example in the sources). 

5. JACK INTERFACE 

There are some additional directives and corresponding command 
line options to configure midizap’s Jack setup in various ways. If 
both the command line options and directives in the midizaprc file 
are used, the former take priority, so that it’s possible to override the 
configuration settings from the command line. Note that all these 
options can only be set at program startup. If you later edit the corre¬ 
sponding directives in the configuration file, the changes won’t take 
effect until you restart the program. 

5.1. Client Setup 

The -j option and the JACK_NAME directive change the Jack client 
name from the default (midizap) to whatever you want it to be. 
To use this option, simply invoke midizap with - j followed by the 
desired client name, or put a directive like the following into your 
midizaprc file: 

JACK_NAME "midizap-XTouchMini" 

This option is useful, in particular, if you’re running multiple 
instances of midizap with different configurations for different con¬ 
trollers and/or target applications, and you want to have the corre¬ 
sponding Jack clients named differently, so that they can be identi¬ 
fied more easily. 
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We've already seen the -o option which is used to equip the Jack 
client with an additional output port. This can also be achieved with 
the JACK-PORTS directive in the midizaprc file, as follows: 

JACK_PORTS 1 

The given number of output ports must be 0, 1 or 2. Zero means 
that MIDI output is disabled (which is the default). You may want 
to use JACK_ PORTS 1 if the configuration is primarily aimed at doing 
MIDI translations, so you’d like to have MIDI output enabled by 
default. JACK-PORTS 2 or the -o2 option indicates that two pairs of 
input and output ports are to be created. As already discussed in 
Section 4, the second port is typically used to deal with controller 
feedback from the application. 

Not very surprisingly, at least one output port is needed if you 
want to output any MIDI at all; otherwise MIDI messages on the 
right-hand side of translations will be silently ignored. 

5.2. MIDI Connections 

Setting up all the required connections for the Jack MIDI ports can 
be a tedious and error-prone task, especially if you have to deal 
with complex setups involving feedback and/or multiple midizap in¬ 
stances. It’s possible to automatize the MIDI connections, e.g., with 
QjackCtl’s persistent MIDI patchbay facility, but this is often incon¬ 
venient if you need to accommodate multiple midizap configurations 
and you already have a complicated studio setup (or indeed a bunch 
of them) which you don't want to mess with. 

Therefore midizap offers its own built-in patchbay functionality 
using the JACK-IN and JACK-OUT directives which let you specify the 
required connections in the configuration itself. The port number is 
tacked on to the directive, so, e.g., JACK-IN2 connects the second 
input port. If the port number is omitted then it defaults to 1, so 
both JACK-0UT1 and just JACK-OUT connect the first output port. The 
directive is followed by a regular expression to be matched against 
the Jack MIDI ports of your devices and applications. For instance, 
the following lines connect midizap to an X-Touch Mini device on 
one side and Ardour's Mackie control port on the other. (This kind 
of setup is rather typical for configurations involving feedback. For 
simple setups just specifying the JACK-IN and JACK-OUT directives 
is often sufficient, or even just JACK-IN if the target application isn't 
MIDI-capable.) 

JACK-INI X-TOUCH MINI MIDI 1 
JACK-0UT1 ardour:mackie control in 
JACK-IN2 ardour:mackie control out 
JACK-0UT2 X-TOUCH MINI MIDI 1 

A connection will be established automatically by midizap when¬ 
ever a MIDI port belonging to another Jack client matches the regu¬ 
lar expression, as well as the port type and I/O direction. This also 
works dynamically, as new devices get added and new applications 
are launched at runtime. Only one directive can be specified for each 
port, but since midizap will connect to all ports matching the given 
regular expression, you can connect to more than one application 
or device by just listing all the alternatives. For instance, to have 
midizap’s output connected to both Ardour and Pd, you might use a 
directive like: 

JACK-0UT1 ardour:MIDI control in|Pure Data Midi-In 1 

All matches are done against full port names including the client- 
name : prefix, so you can specify exactly which ports of which clients 
should be connected. However, note that in contrast to the QJackCtl 


patchbay, midizap does substring matches by default, so that, e.g., 
just "MIDI control” would match any Ardour MIDI control port, 
in any instance of the program (and also ports with the same name 
in other programs). If you want to specify an exact match, you need 
to use the “ and $ anchors as follows: 

JACK-0UT1 ''ardounMIDI control in$ 

5.3. Jack Sessions 

midizap also supports Jack session management which provides a 
convenient alternative way to launch your midizap instances. Once 
you’ve finished a configuration, instead of running midizap manually 
each time you need it, you just invoke it once with the right command 
line options, and use a Jack session management program to record 
the session. The session manager can then be used to relaunch the 
program with the same options later. 

Various Jack session managers are available for Linux, but if 
you’re running QjackCtl already, you might just as well use it to 
record your sessions, too. QjackCtl’s session manager is available 
in its Session dialog. To use it, launch midizap and any other Jack 
applications you want to have in the session, and then hit the “Save” 
button in the Session dialog to have the session recorded. Now, at any 
later time you can rerun the recorded session with the “Load” button 
in the same dialog, and your most recent sessions are available in the 
"Recent” menu from where they can be launched quickly. 

6. CONCLUSIONS 

I hope that you'll enjoy using midizap for your MIDI mapping needs 
as much as I do. I'd like to emphasize, however, that midizap is noth¬ 
ing more (and nothing less) than a simple and practical solution to a 
nagging problem that I have run into time and again (as presumably 
many Linux MIDI users do), midizap has its limitations, and it is 
definitely not intended as a replacement for more ambitious projects. 
Ctlra [ 1 ] along with its Mappa component takes a much higher-level 
approach based on the idea of abstracting device interfaces so that 
basically any Ctlra client can be used with any Ctlra-supported de¬ 
vice. This promises to scale much more easily, but it will take its 
time to gather a critical mass of supported devices and applications. 

In the meantime we now have midizap which is a much more 
modest design, but can make any MIDI controller work with pretty 
much any application out there, as long as the application can be 
controlled with keyboard and/or MIDI commands. And you don't 
need to be a computer expert to use it; if you know how to use Jack, 
a text editor, and the command line, you're good to go. 

Contributions are welcome; in particular, we're looking for in¬ 
teresting configurations to be included in the distribution. I consider 
midizap itself finished at this point (ports, bugfixes and feature creep 
notwithstanding), but one area which could still be simplified is the 
configuration process. While experienced Linux users may actually 
prefer the textual interface that midizap provides (especially when 
using midizap’s Emacs mode), editing configuration files and watch¬ 
ing debugging output in a terminal can be a bit daunting. So a GUI- 
based configuration front-end (maybe something along the lines of 
existing MIDI learn facilities) might be in order here. 

As Ctlra matures, another interesting possibility is to have a di¬ 
rect interface between Ctlra and midizap at some point. It’s already 
possible to run midizap and Ctlra’s daemon program in concert, but 
tighter integration could be achieved, e.g., by adding a Ctlra back¬ 
end to midizap. 
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directive 


midi-token 

msg 

note 

other 

mod 

steps 
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flag 


key-token 
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st ring 


{ directive | header | translation > 

"[" name "]" [ "CLASS" | "TITLE" ] regex 
midi-token { key-token | midi-token } 

"DEBUG_REGEX"|"DEBUG-STROKES"|"DEBUG_KEYS" | 
"DEBUG_MIDI" | "MIDI_0CTAVE" number | 
"JACK-NAME" string | "JACK-PORTS" number | 
"JACK-IN" [number] regex | 

"JACK-OUT" [number] regex | 

"PASSTHROUGH" [ number ] | 

"SYSTEM-PASSTHROUGH" [ number ] 

msg [ mod ] [ steps ] [ number] [ flag ] 

( note | other | "M" ) [ number ] 

( "A" | ... | "G" ) [ "#" | "b" ] 

"CH" | "PB" | "PC" | "CC" | "CP" | "KP:" note 
"[" [ number ] "]" 

"[" number "]" | list "}" 

number { number | ":" number | number } 


"RELEASE" I "SHIFT" [ number ] | 
keycode [ "/" keyflag ] | string 
"XK_Button_l" | "XK_Button-2" | "XK_Button_3" | 
"XK-Scroll_Up" | "XK-Scroll_Down" | 

"XK_..." (see /usr/include/Xll/keysymdef.h) 

"U" | "D" | "H" 

{ character } '"' 
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ABSTRACT 

This paper presents a library for Supercollider that enables live cod¬ 
ing adapted to two domains of performance: telematic dance with 
wireless sensors and electroacoustic music performance. The library 
solves some fundamental issues of usability in Supercollider which 
have been also addressed by the established live-coding framework 
JITLib, such as modifying synth and pattern processes while they 
are working, linking control and audio i/o between synths, and gen¬ 
eration of GUIs. It offers new implementations, which are more 
compact and easy to use while emphasizing transparency and seal- 
ability of code. It introduces binary operators which when coupled 
to polymorphism facilitate live coding. Several foundation classes 
are introduced whose purpose is to support programming patterns or 
commonly used practices such as the observer pattern, function call¬ 
backs and system-wide object messaging between language, server 
processes and GUI. 

The use of the library is demonstrated in two contexts: a telem¬ 
atic dance project with custom low-cost movement sensors, and dig¬ 
ital implementations of early electroacoustic music scores by J. Har¬ 
vey and K. Stockhausen. The latter involves coding of a complex 
score and generation of a GUI representation with time tracking and 
live control. 

1. BACKGROUND 

1.1. Bridging Live Coding and Gestural Interaction 

The performance practice known as live coding emerged from the 
ability of software to modify state and behavior through the inter¬ 
active evaluation of code fragments and to synthesize audio at run¬ 
time. As a result, several programming environments and technolo¬ 
gies supporting live coding have been developed in the past 20 years, 
such as SuperCollider[ 1], Impromptu) 2], ChucK[3] , Extempore) 4], 
Gibber) 5], and others. It has been noted, however, that such envi¬ 
ronments and practices suffer from a lack of immediacy and those 
visible gestural elements that are traditionally associated with live 
performance [6], Recent research projects attempt to re-introduce 
gestural aspects or to otherwise support social and interactive ele¬ 
ments in musical performance using technologies associated with 
live coding ([7], [8], [9], [10]). Amongst various types of gestural 
interaction, dance is arguably the one least related to textual coding. 
Few recent studies exist which prepare the field for bridging dance 
with coding ([11]). The challenges in this domain can be summa¬ 
rized as the problem of bridging the symbolic domains of dance and 
music notation and the subsymbolic numerical domain of control 
data streams input from sensors. This also implies translating be¬ 
tween continuous streams of data and individual timed events, pos¬ 
sibly tagged with symbolic values. This is a technologically higly 


demanding task which is subject of research in various gestural in¬ 
terface applications. The work related in the present paper repre¬ 
sents an indirect and bottom-up approach to the topic, based on DIY 
and open source components and emphasizing transparency and self- 
sufficiency at each step. It does not address the task of gesture recog¬ 
nition, but rather it aims at supporting live coding in conjunction with 
dancers and instrumental performers. Ongoing experiments together 
with such performers, are helping to identify low-level tasks and fea¬ 
tures which are essential for practical work. This type of work is 
purely empirical, and tries to identify useability criteria purely from 
practice, rather than to develop features that are inferred from known 
interaction paradigms in other related domains. At this stage of the 
project it is still too early to formulate conclusions from these ex¬ 
periments. Instead, this paper concentrates on the fundamentals of 
the implemenation framework on which this work is based. These 
are readily identifiable and their potential impact on further develop¬ 
ment work as well as experiments are visible. This paper therefore 
describes the basic principles and design strategy of the sc-hacks li¬ 
brary, and discusses its perceived impact on performances. Finally, 
it outlines some future perspectives for work involving data analysis 
and machine learning. 

1.2. Live Coding Frameworks in Supercollider 

1.2.1. Types of Live Coding Frameworks 

Live Coding libraries can be divided into two main categories de¬ 
pending on the level of generality of their implementation and their 
application scope. First, there are libraries which extend Super- 
Collider usage in order to simplify the coding of very behaviors 
or features which are very common in performance, but are other¬ 
wise inconvenient to code in SuperCollider. To this category be¬ 
longs the JITLib framework. JITLib (Just-In-Time programming Li¬ 
brary) has been around since at least August 2006, with an early 
version since ca 2000 1 and is very widely used in the commu¬ 
nity, being the de-facto go-to tool for live coding in SuperCollider. 
The second category consists of libraries that concentrate on spe¬ 
cialized usage scenarios and attempt to create domain-specific mini¬ 
languages for those scenarios on top of SuperCollider. Such are: 
IXI-Lang (a sequencer / sample playing mini-language by Thor Mag- 
nusson [12]), SuperSampler (a polyphonic concatenative sampler 
with automatic arrangement of sounds on a 2-dimensional plane, 
by Shu-Cheng Allen Wu [13]), and Colliding (An "environment for 
synthesis-oientd live coding", simplifying the coding of Unit Gen¬ 
erator graphs, by Gerard Roma [14]). Finally, TidalCycles by Alex 
McLean [15] should be mentioned, which develops its own live cod¬ 
ing language based on Haskell and focussing on the coding of com- 

^ee https://swiki.hfbk-hamburg.de/ 

MusicTechnology/566 (accessed 20-December-2018) 
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plex layers of synchronized beat cycles with sample playback and 
synthesis, and uses the Supercollider synthesis server as audio en¬ 
gine. 

1.2.2. sc-hacks Objectives and Approach 

sc-hacks belongs to the first category of frameworks, and its initial 
motivation was partly to implement some of the solutions of JITLib 
in more robust, simple, and general ways. In parallel, inspiration 
from ChucK’s => operator led to the development of a minimal ex¬ 
tension of the language based on 4 binary operators (+>, <+, *> 
* <), which, coupled with polymorphism, permit simplified and com¬ 
pact coding of several common sound-structure coding patterns. Fur¬ 
thermore, the implementation of some basic programming patterns 2 
opened new possibilities for the creation of GUI elements which up¬ 
date their state. This led to a proliferation of GUI building and man¬ 
agement facilities and resulted in several interfaces for live coding 
tasks, such as a code browser based on the concept of code snippets, 
a browser for editing and controlling the behavior of named players 
holding synth or pattern items, and shortcuts for building GUI wid¬ 
gets displaying values of parameters controlled by OSC, MIDI or al¬ 
gorithmic processes. Finally, ongoing experiments with dancers and 
instrumentalists are giving rise to new interface and notation ideas. 
The current focus is on building tools for recording, visualising and 
playback of data received from wireless sensors via OSC, in order 
to experiment with the data in performance, and to apply machine- 
learning algorithms on them. 

2. APPROACH 

2.1. Players and Player Environments 

JITLib addresses four fundamental problems in coding for concur¬ 
rent sound processes: (a) Use of named placeholders for sound gen¬ 
erating processes, (b) managing the control parameters of processes 
in separate namespaces, (c) modifying event-generating algorithmic 
processes (known in Supercollider as Patterns ) on the fly and (d) 
interconnecting audio signals between inputs and outputs of synth 
processes. Sc-lib offers alternative solutions to these problems which 
present advantages, described in the following sections: 

2.1.1. Named placeholders: -def classes vs. Player class 

To use a name as placeholder for a synth process in order to start, 
stop or modify the process on the fly, JITLib introduces the [X-]def 
convention, i.e. it defines a number of classes which act as named 
containers for different types of processes (Synths: Ndef, Tasks: 
Tdef. Patterns Pdef, etc.). Sc-hacks uses a single Player object 
class instead. A Player instance can play a Synth or a Pattern de¬ 
pending on the type of source which it is asked to play, i.e. synth 
definition, synth function, or event-stream generating instance (see 
for example code below 3). This provides greater flexibility and sim¬ 
plicity in the coding of synth processes over JITLib. 

2.1.2. Separate parameter namespaces: ProxySpace vs. Nevent 

A significant innovation introduced by JITLib consisted in the con¬ 
cept of a ProxySpace, that is, a namespace that can function as the 
current environment. ProxySpace is based on EnvironmentRedirect, 

2 See for example the Observer pattern: https : //en. wikipedia. 
org/wiki/Observer_pattern (accessed 20-December-2018) 


a Class which holds a Dictionary and ensures that a predefined cus¬ 
tom function is executed each time that a value is stored in one of 
the keys of the Dictionary. Sc-hacks defines a subclass of Environ¬ 
mentRedirect similar to ProxySpace, but defines a custom function 
that provides extra flexibility in setting values which is useful during 
performance in accessing control parameters. This enables keeping 
track of which parameter refers to which process, storing parameter 
values between subsequent starts of a process belonging to a player, 
and updating GUI elements to display values as these change. Ad¬ 
ditionally, sc-hacks makes the environment of the player current af¬ 
ter certain operations, in order to make the current context the one 
normally expected by the performer. This however is not always a 
secure solution. For this reason, the target environment can be pro¬ 
vided as adjective argument in binary operators involving players, 
which ensures that code will work as expected even when changing 
the order of execution of code in irregular manner. 

2.1.3. Modifying event generating processes on the fly 

Event generating algorihm processes are implemented in Supercol¬ 
lider through class Pbind. Pbind takes an array of keys and associ¬ 
ated streams as argument and creates a Routine that calculates pa¬ 
rameters and event types for each set of keys and values obtained 
from their associated streams, and schedules them according to the 
duration obtained from the stream stored under the key dur. The im¬ 
plementation of Pbind allows no access to the values of each event, 
i.e. it is not possible to read or to modify the value of a key at any 
moment. Furthermore, it is not possible to modify the structure of 
the dictionary of keys and streams while its event-generating pro¬ 
cess is playing. This means that Pbind processes cannot be modified 
interactively while they are playing. In order to circumvent this lim¬ 
itation, a number of techniques have been devised which require to 
add code for any key that one wishes to read or to modify. JITLib 
uses such techniques and also provides a way to substitute a Pbind 
process while it is running with a new one, thereby indirectly al¬ 
lowing modification of that process. Sc-hacks provides a new ap¬ 
proach for playing event-generating processes, which uses the same 
Event-playing mechanism as Pbind, but grants both read and write 
access to the data which generate the event stream, and thus permits 
modification of the generating key-stream collection on the fly. This 
radically simplifies the task of modifying event generating processes 
while they are playing. For example, adding or substituting key- 
value stream pairs to a process while it is playing can be achieved 
simply by sending the corresponding key-stream pairs as events to 
the same player, as shown in the following code 1. 

(dur: 0.1) +> \mystream; 

// Substitute duration stream: 

(dur: [0.1, 0.2].prand) +> \mystream; 

// Add degree stream: 

(degree: (-10..10).prand) +> \mystream; 

Figure 1: Adding and substituting key streams to event generators. 

2.1.4. Interconnecting audio signals 

The task of connecting the output of one audio process with the input 
of another audio process is complicated in Supercollider by the 
requirements (a) to specify the bus which will carry the signal to be 
shared and (b) to ensure that the synth reading from the signal will 
be placed after the bus which is writing to the signal in the execution 
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order of the synth engine (scsynth). The implementation of the 
solution in JITLib involves several classes with several instance vari¬ 
ables and hundreds of lines of code and defies description within the 
scope of the present paper. Additionally, coding the configuration of 
one-to-many or many-to-one interconnections of audio i/o between 
synth processes can be both verbose and complex, as witnessed for 
example in exchanges on the Supercollider mailing list such as 
this one: https://sc-users.bham.ac.narkive.com/ 
PAapaSaM/many-to-one-audio-routing-in-jitlib 
(accessed 20-December-2018). Sc-hacks introduces a new solu¬ 
tion which permits simpler coding and guarantees persistence of 
established configurations even when the server is rebooted during 
a work session. The implementation is based on mechanisms 
for hierarchical namespaces and function callback implemented 
in sc-hacks through two new classes discussed below: Registry 
and Notification. The coding of one-to-many and many-to-one 
connections is exemplified through the following code 2: 

// many - to - one interconnection 
\sourcel *> \fxl; 

\source2 *> \fxl; 

// one - to - many interconnection 
\source3 *< \fx2; 

\source3 *< \fx3; 

Figure 2: Interconnecting audio signals. 

Note that no additional coding is required if using the default input 
and output parameter names \in and \out and number of chan¬ 
nels (1). PersistentBusProxy is used to specify custom parameter 
names and channel numbers. The operator @ can optionally be used 
as shortcut to create PersistentBusProxy instances. 

2.2. Binary operators 

The primary coding strategy of sc-hacks for sound processes is built 
around a small number of binary operators. Each operator encapsu¬ 
lates a group of actions on sound objects such as synthesis parame¬ 
ters, player objects holding single synths or synth processes, busses, 
buffers, midi or osc control instances. The operators are: 


left operand 

operator 

right operand 

source 

+ > 

player 

source 

~k > 

player 

parameter 

< + 

value 

parameter 

* <s 

value 


2.2.1. +> : Play source in player 

The +> plays the source in the player. The source can be the name 
of a synthesis definition as symbol, a synthesis function, or an event. 
For example the code in 3 can be evaluated line-by-line to play in 
the player named ' example' in sequence a synth using SynthDef 
named 'default' , a Unit Generator Synth Graph containing a 
Sine Oscillator, an empty event with default parameters (degree: 0, 
dur: 1), an event with duration 0.1, and an event with degree a pattern 
using a brownian stream with values between -10 and 10 and max¬ 
imum step 2. Sending different types of sources (synthdef names, 
synth functions, events j to the same player will replace the previous 
source with the newest one. Sending nil stops the player. 


\default +> \example; // play synthdef 
{ SinOsc.ar (440, 0, 0.1) } +> \example; 

() +> \example; // play event 
(dur: 0.1) +> \example; // modify event 
(degree: [-10, 10, 2].pbrown) +> /example; 
nil +> \example // stop player; 

Figure 3: Player operator+>. 

Additionally, sc-hacks permits one to browse the code executed for 
each player on a dedicated GUI (similar to operations on Shreds in 
the miniAudicle GUI of ChucK), to edit existing code and resend it 
to the player, and to start or stop a player by clicking on its name in 
the list of existing players, as shown in Figure 4. The list of evaluated 
code strings is permanently saved on file for each session. 

# # # Players : example 




\defaulfc 


Quit Server Stop all 

Figure 4: Player GUI. 


2.2.2. *> : Advanced operations on player argument 

The *> operator takes different meanings depending on the type of 
the right operand, as follows: 


type of left operand 

action 

Event 

Function 

Symbol 

PersistentBusProxy 

set parameter values without starting events 
Play function as routine in environment 

Add receiver as audio source to argument 
Add source with custom i/o mapping 

2.2.3. <+: Set or map parameter 

The <+ operator acts on the parameter named by the receiver (left 
operand) depending on the type of the argument (right operand), as 
follows: 

type of right operand 

action 

Integer or Float 
Symbol 

Envelope 

Function 

MIDI 

OSC 

Set parameter value 

Map parameter to named control bus 

Map parameter to envelope signal 

Map parameter to Synth Function output 
Bind parameter to MIDI input 

Bind parameter to OSC input 
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The parameter named by the left operand belongs by default to the 
current environment. In order to specify a different environment, one 
can name the environment as an adverb to the binary operator using 
standard Supercollider syntax, e.g.: \freq <+.myenvir 660. 

2.2.4. *<+: One-to-many audio i/o interconnections 

The *< operator, in analogy to *>, is used to create one-to-many i/o 
interconnections, that is, to connect the audio output from one Player 
to the inputs of several different Players. 

2.3. Fundamental Classes 

To implement the above features, sc-hacks introduces classes which 
implement pattern-language-like features that enable functionality 
across a wide variety of tasks such as storing and retrieving single 
instances in tree data structures (Registry Class), updating state of 
concerned items in response to changes (Notification Class), and en¬ 
forcing sequential order of execution in asynchronous calls to the 
server when booting, loading synthdefs and loading or initializing 
audio buffers (ActionSequence Class). These classes formed the 
backbone for rapid creation of custom extensions to the library to 
meet needs of performance requirements described in the next sec¬ 
tion. These results are encouraging indications that the library will 
serve as framework to develop more ambitious applications in the 
next stages of this work. 

3. APPLICATIONS 

3.1. Telematic Dance 

Sc-hacks was first used in a telematic dance project whose goal is 
to enable dancers to perform together concurrently in different cities 
by sharing data from motion sensors sent via OSC over the internet 
[16]. Sensors were constructed using LSM9D0 motion sensor mod¬ 
ules and Feather Huzzah ESP8266 wifi modules from Adafruit. and 
connected to Supercollider via micro-osc package on micropython. 
Several sessions with dancers in Tokyo, Athens and Corfu served 
to experiment with different sound synthesis algorithms and to test 
the usabiity of the interface and algorithms for dance improvisation. 
The results were generally more encouraging than expected, except 
in Corfu where the dancers showed a more cerebral approach em¬ 
phasizing control over the sound result rather than free exploration 
of the sonic landscape through movement. 

A significant new turn in the development of the library was 
prompted during the initial tests for remote collaboration performed 
during a workshop organized at the University of Manchester by 
Prof. Ricardo Climent for the EASTN-DC EU-Culture program. 
This showed the need for distributing versions of the library to differ¬ 
ent remote partners, using different custom settings for each partner. 
Opening files in the Supercollider IDE in order to select and exe¬ 
cute appropriate code segments was soon proven to be impractical 
under the pressed time circumstances of preparing the test within a 
large scale workshop and awkward time-zone difference between the 
partners involved. Thus, a plug-and-play solution had to be devised, 
or at least one that relied on selecting options front menus or lists 
and clicking on buttons rather than opening files and executing code. 
This gave rise to a new interface as a GUI for selecting and evaluating 
snippets of code contained within files within subfolders of a global 
"Snippets" folder 5. The scheme has since served for the archival of 
experiments and performances, facilitating easy overview and reuse 


of past code. It is furthermore integrated for use with EMACS as 
primary IDE for Supercollider, with automatic updates of code be¬ 
tween EMACS and the Supercollider based GUI. 

Two further features were necessary for the experiments with 
dancers. First, a GUI that displays OSC data as they are received, 
and second a mechanism that scales and assigns incoming OSC data 
to the desired parameters. The following code shows how to generate 
a gui that displays data changes for a set of named parameters. Up¬ 
dates are displayed whenever a parameter is changed, independently 
of the source of the change (i.e. automated algorithm, evaluation of 
code, MIDI or OSC input). 

\lsml.v( 

\dur.slider([0.1, 12], \lsml), 

\pos.slider([0.0, 1.0], \lsml), 

\rate.slider([0.2, 15], \lsml), 

\gps.slider([0.5, 20.0], \lsml), 

\pan.slider([-1, 1.0], \lsml), 

\amp.slider(\amp, \lsml) 

) ; 

The GUI in figure 6 was generated by the code above. 

Following example shows how to scale data input from OSC mes¬ 
sages and to assign them to named parameters in a specified envi¬ 
ronment ' lsml'. 

\dur <+.lsml 


'/gyroscopel' 

.osc(0, [- 

•40, 

40] 

, [0. 

.01, 12.5 

\pos <+.lsml 






'/gyroscopel' 

.osc(1, [- 

•20, 

40] 

, [0. 

o 

I- 1 

o 

\rate <+.lsml 






'/gyroscopel' 

.osc(2, [- 

•20, 

40] 

, [0. 

.1, 15]); 

\gps <+.lsml 






'/magnetomete 

r1' .osc (0, 

[-: 

1.0, 

0.5] 

t 



[0 

.2, 

15] ) ; 



\pan <+.lsml 

'/magnetometer1'.osc (1, [-0.25, 0.25], 

[- 1 , 1 ]) ; 

\amp <+.lsml 

'/magnetometerl' .osc (2, [-0.05, 0.25], 

\amp); 

The above features are only the beginning. As experiments with 
dancers have shown, other GUIs and coding schemes are needed to 
facilitate adjustment of the responsiveness of the sensors and adap¬ 
tation of their sound control aspects during performance. In this re¬ 
spect a considerable amount of work is still required. 

3.2. Coding Electroacoustic Music Performances 

A second test scenario was provided through the collaboration with 
Dan Weinstein, a concert cellist specializing in contemporary music 
performance with good knowledge of contemporary audio tools in 
Linux. Mr. Weinstein selected two pieces front the early repertory 
of electroacoustic music scored for tape recorder: Jonathan Harvey’s 
"Ricercare una melodia" and Karlheinz Stockhausen’s Solo 19. Both 
pieces had to be coded in Supercollider and rehearsed within one 
week during a residency of Mr. Weinstein in Corfu, leading to a 
public performance of the pieces. The time constraints were critical 
because the pieces were both complex and demanding in terms of 
score interpretation, following and coordination. The Stockhausen 
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# # # SnippetList 

//:preload lamentodellaninfa sample 

\vento.loadBuffer("/Users/iani/Music/sounds/sounds-md/lamentodellaninfa.wav"); 
//:grain synth 
{ 

GrainBuf.ar( 

2, 

Impulse.kr(\gps.kr(1) * 2), 

\dur.kr(l) * 0.5, 
sndbuf: Xlamento.b.bufnum, 
rate: \rate.kr(l) / 2.5, 
pos: \pos.kr(0), 
interp: 2, 
pan: \pan.kr(0) 

) * \amp.kr 

> +> \lsml; 

//:grain synth gui 
\lsml.v( 

\dur.slider([0.1, 12], \lsml), 

\pos.slider([0.0, 1.0], \lsml), 

\rate.slider([0.2, 15], \lsml), 

\gps.slider([0.5, 20.0], \lsml), 

\pan.slider([-1, 1.0], \lsml), 

\amp.slider(\amp, \lsml) 

); 

//:grain synth connection 

\dur <+.lsml '/gyroscope1'.osc(0, [-40, 40], [0.01, 12.5]); 

\pos <+.lsml '/gyroscopel'.osc(1, [-20, 40], [0.0, 1.0]); 

\rate <+.lsml '/gyroscopel'.osc(2, [-20, 40], [0.1, 15]); 

\gps <+.lsml '/magnetometer1'.osc(0, [-1.0, 0.5], [0.2, 15]); 

\pan <+.lsml '/magnetometer1'.osc(1, [-0.25, 0.25], [—1, 1]); 

\amp <+.lsml '/magnetometer1'.osc(2, [-0.05, 0.25], \amp); 

//:simpler sound 
{ 

SinOsc.ar( 

[Lag.kr(\freq.kr), 

Lag.kr(\freq2.kr) 

] 

, 0, 0.1) 

> +> \lsml; 

//:simple gui 
\lsml.v( 

\freq.slider(\freq, \lsml), 

\freq2.slider(\freq, \lsml) 

); 

//:simple controller - twist 

\freq <+.lsml '/gyroscopel'.osc(0, [-10, 40], Xfreq.asSpec); 

/* 

twist of hand 
*/ 

//:simple controller hand up down vertical 

\ f ^4. 1 ami 1 / mtr-no r->r\r\c* 1 1 I 1 r _ 1 n Afll \ fra/r acCnorl . 

\freq <+.lsml '/gyroscopel 1 .osc(0, [-10, 40], Xfreq.asSpec); 
\freq2 <+.lsml '/gyroscopelosc(1, [-10, 40], Xfreq.asSpec); 
Xfreq <+.lsm0 '/gyroscopeO'.osc(0, [-10, 40], Xfreq.asSpec); 
Xfreq2 <+.lsm0 '/gyroscopeO'.osc(1, [-10, 40], Xfreq.asSpec); 

/* 

twist of hand 

Stop all Read Folders Play Score * / 


— UTILITIES MENU — 


OOOHelpFiles 

OOFirstExamples 

ComputerMusicClass1810 

Delays 

Installations 


MagneticDance 


Performances 
RiaGeorgiadou 
Server 
SoundFiles 
Synth Defs 

TestsAndDevelopment 


MagneticDance180917 

MagneticDance181028 

Magnetic Dancel 81028_LH 

MagneticDance_Basic_Recipes 

MagneticDance_Garage21_1_181103 

MagneticDance_Simple_Simulation 

NymphsArtiria181216 _ 


Nymphs_Artiria181215 


UsingBuses 


preload lamentodellaninfa sample 
test buffr 

preload lamentodellaninfa sample 

grain synth 

grain synth gui 

grain synth connection 

simpler sound 

simple gui 

simple controller - twist 
simple controller hand up down vertical 
simple controller up down or twist 
Test with 2 hands 

simple gui 2 hands _ 


simple controller - twist 


Figure 5: Snippet List GUI. 



Figure 6: Grain Control GUI. 


piece proved to be especially difficult as it is initially scored for 4 as¬ 
sistants in the electronic part, where each assistant is assigned control 
of the recording, playback and feedback levels of two tape recording 
channels with varying loop durations between sections, using two 
potentiometers. To execute this with a single performer on the com¬ 
puter, the slider actions as well as the loop duration changes had to 
be automated according to the indications in the score. Even under 
these circumstances, an ideal faithful performance was impossible, 
because each of the 6 levels demanded constant adjustment accord¬ 
ing to the actual level of the instrumental performer, and each transi¬ 


tion had to be timed manually to prevent abrupt noticeable changes. 
Still, this proved to be a fruitful exercise in creating a user inter¬ 
face and coding the entire score, consisting of 6 different realization 
versions. It resulted in a compact coding scheme for durations of 
prescribed length (see 7 for the notation of the first version - Form- 
schema I, and 8 for its translation into GUI and automated perfor¬ 
mance). This notation mechanism can in the future be repurposed as 
a type of beat sequencing notation similar to this found in ixilang or 
TidalCycles (although the Cycle scheme of Tidal has other features 
which go beyond the scope of the present discussion). 

4. CONCLUSIONS AND FUTURE WORK 

Sc-hacks is a general purpose extension to Supercollider, and the 
intense use of several binary operators may raise doubts about its 
legibility or the general validity of its design priorities. However, 
stress-testing sc-hacks through collaborations with dancers and in¬ 
strumentalists has shown its strong potential to solve diverse and 
demanding problems under time pressure, and furthermore has pro¬ 
vided indications of its scalability in terms of coding various fea¬ 
tures. This indicates that it is a suitable platform for further work, 
and it is hoped that it will serve as a tool for addressing questions 
of machine listening in live performance as well as other advanced 
topics. 
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*formschemaI { 

'“StockhausenSoloFo rmschema ( 

[ // numperiods, duration per period 

[ 11 , 6 ], 

[8, 14.2], 

[7, 19], 

[6, 25.3], 

[9, 10.6], 

[ 10 , 8 ] 

], thisMethod.name.asString 

) 

.loadPeriodStates ( 

X_X_XXX_ | XXXXXXX_ | X X | X_X_ | XXX | X_X_X_X_ 
_XX_XX_XXX I XXXXXXX_ I XXX_XXX I XX I XXXXX_XX_ I _x_x_x_x_ 
Lxxxx_xxxx I _XXXXXXX I _X_XXX I _X_XX I _XX_XX_ I _xxxxx_x_ 
_X_XX_XX I _XXXXXX_ I _XX_XX I X I _XXXXXXXX I X X X 
_XX_XX_XX_ I _XXX_XXX I XXX_XX_ I _XX_XX I X_XXX_XXX I _x_xx_x_xx 
_XX_XX_X I _X_XXX_X I X XXI _XX_ I XXXI x_xx_x_xxx 

); 

> 


Figure 7: Code for Formschema I of Stockhausen Solo 19. 


Recording data received from sensors is a first priority in the project. 
A first prototype has been implemented using the built-in archival 
facilities of Supercollider. A second implementation is currently 
under development, which will record data into multichannel audio 
signal buffers, and employ an extra channel to record the time inter¬ 
val between receipt of successive OSC messages. Based on this, and 
using the existing graphic visualization facilities of Supercollider 
for audio signals, a functionality similar to the MuBu tools from IR- 
CAM 3 is envisaged. In collaboration with PhD students working on 
Machine Learning, it is planned to use this for further research. 

In parallel, work is being done to connect data sent over the internet 
in remote performances, and in developing a performance repertory 
with instrumental soloists interested in improvisation with live elec¬ 
tronics. In both these cases, the most serious challenge consists in 
making the software stable and easy to use enough to be able to re¬ 
lease it to non-specialist performers for work in real-world creative 
events without the need of specialized technical assistance to run it. 
This remains a major driving factor and design guideline in devel¬ 
oping this software. At the same time it is expected that these re¬ 
quirements will help create best practice solutions that constitute the 
wider contribution of this project. In this sense, the present project is 
placed within the scope of efforts for developing contemporary lan¬ 
guages of notation for performance practice that have lasting impact 
on the community and its aesthetics. 
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Figure 8: GUI for Formschema 1 of Stockhausen Solo 19. 
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ABSTRACT 

Bipscript is a domain-specific scripting language designed to make 
it easier to create interactive music. The base language is the 
Squirrel scripting language which has been complemented with a 
standard class library containing audio-specific domain objects. 
This API provides methods for creating, scheduling and handling 
events of various types including MIDI, OSC and extracted 
features of audio streams. A single-threaded programming model 
with asynchronous event handling is familiar to web developers but 
atypical in music DSLs. Scripts are executed by a command line 
interpreter with tight integration to the audio system. 

1. INTRODUCTION AND DESIGN GOALS 

The Bipscript project began as an attempt to implement a musical 
"hot" application, with the base functionality of existing auto¬ 
accompaniment software augmented with a high degree of 
interactive functionality. The goal was software that would output 
appropriate MIDI sequences in real-time based on external inputs, 
most notably data from human performers. The emphasis was on 
tempo-driven music with tight integration to a local transport. 

Design goals did not include specialization on any particular style 
of music, nor any specific assumptions on how external inputs 
would affect generation of MIDI sequences, leaving these 
decisions to the configuration of a particular piece. 

As the options for configuration grew it became clear the easiest 
way to express the behavior of a particular musical part would be 
via an imperative language with the abilities to directly receive 
relevant input from external sources, and use this information 
algorithmically to sequence MIDI notes. 

2. FEATURES 

2.1. Squirrel 

The scripting language itself is the Squirrel language [1]. From the 
Squirrel website: 

"Squirrel is a high level imperative, object-oriented programming 
language, designed to be a light-weight scripting language that fits 
in the size, memory bandwidth, and real-time requirements of 
applications like video games." 

These attributes and the associated predictability in run-time behav¬ 
ior make Squirrel an ideal language also for the real-time demands 
of audio applications. 


Bipscript builds on top of Squirrel by adding a class library API 
containing audio domain-specific classes and a custom transport- 
aware interpreter that allows for event handling. 

2.2. Class Library 

The Bipscript class library API features objects representing plug¬ 
ins, mixers, and system inputs and outputs of various types, pre¬ 
dominately audio, MIDI and OSC [2]. These objects can be con¬ 
nected programmatically to create complex networks of the differ¬ 
ent protocols (see Figure 1) 

Events of any applicable type can be scheduled to occur on any 
node in the network, in particular code can generate and output 
timed MIDI and OSC sequences. Event handlers can be registered 
to fire on particular events including features extracted from audio 
streams. 

Additional classes allow the use of textual specification of musical 
score data in the scripts using ABC notation [3], Music Macro Lan¬ 
guage [4] or a MIDI tablature format based on common drum tab- 
lature. 



local midilnput = Midi.InputC'myinput") 


local synth = Lv2.Plugin("urn:example:a-synth") 
synth.connectMidi(midilnput) 

7 

local effect = Lv2.Plugin( "urn:ardour:a-reverb" ) 
effect.connect(synth) 

10 

local mainOutput = Audio.StereoOutput(“main") 
mainOutput.connect(effect) 


Figure 1: Creating Connections. 


2.3. Threading and Context Model 

Scripts including event handlers run in a single execution thread 
with a single global context. Instructions in a script will be exe¬ 
cuted sequentially until the main body of the script completes. At 
this point any event handlers that were registered by the script will 
execute as needed in the same thread and scope with direct access 
to all variables defined in the main body of the script. 

This programming model is analogous to traditional JavaScript de¬ 
velopment in a web browser where the main body of the script reg¬ 
isters event handlers that are then executed in the same thread 
within the same page context. In both cases execution is single- 
threaded and non-blocking in favor of asynchronous event han¬ 
dling. 
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Although many objects such as plugins will participate in the audio 
client’s process thread the script thread itself is separate from the 
process thread and not subject to its programming limitations on 
e.g. memory allocation and system I/O. 

2.4. Transport-Aware Interpreter 

The command-line interpreter is tightly integrated with the system 
transport, on Linux provided by the Jack Audio Connect Kit [5]. 
Scripts can request to be the transport master via the API but are 
not required to do so when there is an external master present to 
provide position information. 

Whether or not the script itself is the transport master, the inter¬ 
preter will act as a sequencer for any events scheduled by the 
script, playing them in time with the transport and reacting to any 
arbitrary external transport position changes including looping. 

3. COMPARISON WITH OTHER PROJECTS 

3.1. Synthesis Languages 

There exist several DSLs for music creation with large and active 
communities, for instance Pure Data [6] and Supercollider [7] 
among others. These environments differ from each other, for in¬ 
stance Supercollider is a traditional text-based programming lan¬ 
guage while Pure Data is a visual language. 

There is a however a common emphasis on sound design with code 
libraries of elements representing oscillators, filters and other sig¬ 
nal generation capabilities. Bipscript currently offers no such ob¬ 
jects in its standard class library instead offering hosting capabili¬ 
ties for 3 rd party sound generation and effects plugins. 

The emphasis instead is on timed music which has led to an API 
built around handling events and just-in-time sequencing using data 
structures representing e.g. MIDI notes and mutable patterns 
(groups of notes). In contrast most music DSLs produce timed mu¬ 
sic at a lower level by alternating between immediate sound-gener¬ 
ation instructions and some variation of a system “sleep” com¬ 
mand. 

3.2. Other Open Source Projects 

Other comparisons can be drawn to some of the many projects aris¬ 
ing from the community of open source audio software on Linux 
and elsewhere: 

One of the most feature-rich open source audio applications is the 
Ardour DAW [8], which in recent versions has a large number of 
the C++ implementation classes exposed as Lua objects [9] giving 
a scripting environment incorporating much of Ardour's MIDI and 
DAW functionality. This differs from a more traditional script lan- 
guage-plus-interpreter environment in that scripts are executed as 
callbacks in application-specific contexts, each with their own 
scope and applicable model objects. 

Another project using Lua is the Moony Lv2 plugins [10]. Taking 
advantage of Lua's real-time performance, scripts are run directly 
in the process thread and thus allow manipulation of e.g. MIDI 


messages as they pass through the plugin, but are bound by all the 
standard real-time limitations of running in the process thread. 

LuaJack is a Lua binding library for Jack [11], Scripts written in 
LuaJack and Bipscript have a visual similarity due to the similarity 
between Lua and Squirrel and the fact the LuaJack and Bipscript 
API are wrapping some of the same objects, e.g. system ports. 
However as a language binding LuaJack does not include an inter¬ 
preter nor an object API beyond directly exposing the Jack client 
API. 

4. IMPLEMENTATION 

The command line executable that functions as an interpreter to ex¬ 
ecute scripts was written in C++ and runs on the Linux operating 
system with the Jack Audio Connection Kit as a run-time depen¬ 
dency. 

The standard Squirrel implementation is intended to be embedded 
and was used as the basis of the command line interpreter. 

The interpreter also acts as a standard audio client, opening and 
connecting system audio and MIDI input/output ports and hosting 
plugins as specified by the executing script. 

Scripts are loaded and run in a dedicated execution thread separate 
from the application’s audio process thread. Any event generated 
by the script is appended to an applicable lock-free queue that is 
consumed by the process thread. 

Script objects that hold scheduled events will participate in the au¬ 
dio process thread to pull events from the queues and emit them at 
the appropriate position in a running transport with sample-level 
accuracy. Synchronization between the script and process threads 
allows the script execution to properly respond to arbitrary trans¬ 
port location changes. 

A set of bindings was created for the class library, generated from 
a high level API description to interface the C++ implementations 
of the standard library classes and methods to the Squirrel engine 
via its stack-based API. 

The object implementations make use of reliable third party code 
where possible via both embedded code and dynamically linked li¬ 
braries. The current implementation makes uses of popular 
projects such as abcmidi, liblo and libsndfile. 

5. USE CASES 

Now existing in a basic implementation, Bipscript can be used for 
the following use cases: 

5.1. Dynamic Accompaniment 

An example “Robot Jazz Band” was built [12], showing the script 
implementation of 3 related bots (playing acoustic bass, piano and 
trap drum samples) that play a dynamic sequence based on rhyth¬ 
mic probabilities in a jazz “swing” pattern coupled with a given in¬ 
put chord progression. All players take an input parameter of “in¬ 
tensity” playing louder and busier vs. softer and sparser on a mea¬ 
sure by measure basis. The main script connects an audio onset de- 
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tector to a system audio input and continuously updates this inten¬ 
sity variable for all players based on the number of onsets received. 

This example shows a standard programming model for creating a 
script with this kind of interactive behavior: 

• The main script instantiates plugins and needed audio and 
MIDI connections (see Figure 1) and schedules any static 
parts of the score 

• Event handlers listen to human performers via e.g MIDI, 
OSC and/or audio features and update an internal state 
(see Figure 2) 

• Scheduled methods read this state and use as input in cal¬ 
culating a short output sequence (e.g. a beat or bar at a 
time) 

The result is a dynamic and reactive auto-accompaniment system 
written in a relatively few lines of code compared to general pur¬ 
pose programming or even other music DSLs. The simple example 
works as expected in practice but leaves open many paths for future 
development in creating more complex interactive scripts. 


23 

local controlPower = 0 

£- *-T 

controlInput.onControl(function(cc, pos) { 

Z J 

if(cc. controller!) == 15) { 

26 

controlPower = cc.valued * 100 / 127 

1 

}) 

28 


Figure 2: Sample Code from the Robot Jazz Band Demo. 

5.2. Utility Scripts for Live Performers 

In many traditional live musical projects there is no need for dy¬ 
namically generated sequences, especially those where human per¬ 
formers are playing from a static score. In such an environment 
there is still a use for certain computer-aided functions such as trig¬ 
gering samples in time or adding a “click track” or other timed au¬ 
dio cues not heard by the audience. These functions can be built in 
custom scripts with relatively few lines of code. 

5.3. Live Coding 

Another prospective use is that of live coding. Those functions of 
the command line interpreter that allow for developer convenience 


may also be useful in a live coding situation, e.g. the ability of the 
script to be loaded and unloaded dynamically including while the 
transport is running. This use case is as of yet mostly unexplored 

6. FUTURE WORK 

With the completion of a basic proof-of-concept interpreter and 
class library the main focus of the project remains stabilizing the 
standard API and improving the basic tool implementations espe¬ 
cially with an eye to reliability in live settings. To this end recent 
work has been done in the area of unit and functional testing for 
testing scripts as well as the interpreter itself. 

Much of the API design going forward should be based on feed¬ 
back from those who use these tools in a live production setting. 
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ABSTRACT 

The paper presents SoundPrism 1 , a real-time sound analysis server 
and dashboard. SoundPrism collects real-time sound data from a 
microphone or PCM and performs analysis which can be broadcast 
in near real-time for use by external applications. A dashboard 
interface visualizes the collected data in a matrix allowing more 
useful streams to be easily identified. This is useful in applications 
where mapping higher level sound events to another form of output 
desired. The tool offers particularly interesting new possibilities for 
musicians and audio-visual artists. The design, implementation, and 
conclusions will be discussed within this paper. 

1. INTRODUCTION 

SoundPrism is a software project that aims to simplify and expand 
the ways that real-time signal analysis and music information 
retrieval can be used in creative applications. 

This project is intended for providing simpler access to near real¬ 
time MIR data as it is captured. The author’s interest in this project 
relates to music visualization, composition, and the creation of 
audio-visual performances, installations, and experiences. 
However, the author believes the technique has vast applications 
across disciplines. 

SoundPrism is made of three components: a real-time database, a 
graphical dashboard, and a domain specific language for 
composing data flows out of signal analysis and music infortmation 
retrieval algorithms. 

SoundPrism was built using OpenFrameworks, a creative coding 
toolkit for C++, and the Essentia Project, a collection of signal 
analysis and music infortmation retrieval algorithms. Both software 
have cross-platform compatibility. SoundPrism was developed 
using Linux using the Linux-rt kernel and JACK. 

1.1. Domain Specific Language 

SoundPrism offers a domain specific language for composing 
analysis chains out of Essentia’s algorithms. These chains are then 
available to be visualized, stored, or broadcasted by the graphical 
dashboard. This allows new analysis chains to be designed without 
having to delve into all the related scaffolding. 

1.2. Graphical Dashboard 

As a graphical dashboard, SoundPrism serves as a tool for thought 
by providing a simultaneous view of up to 16 different data 


I https://mgs.nyc/projects/2018/SoundPrism 


streams. This format amplifies the user’s ability to quickly discern 
the usefulness of a certain analysis in a certain context. 

The dashboard offers multiple visual styles for rendering the data, 
for example: line graphs, histograms, and 3d cascading waterfalls. 
The 3D visualizations use historical data to offer context of time. 
The dashboard allows the user to select the most useful algorithms 
to either store their data, broadcast it to the network, or both. 



1.3. Data Server 

SoundPrism broadcast selected streams so external applications are 
able to use them. This allows existing applications to take full 
advantage of the real-time signal analysis and music information 
retrieval capabilities offered by SoundPrism. Furthermore, the data 
can be accessed remotely which allows for decoupling of 
applications from the audio’s physical location. 

2. IMPLEMENTATION 

SoundPrism was built in C++ using OpenFrameworks, for 
graphical visualization and i/o, and the Essentia Project is 
leveraged for its signal processing and music information retrieval 
capabilities. For broadcasting of data, SoundPrism uses 
OpenSoundControl (OSC) which makes the data available for use 
by any other OSC-enabled application. 

SoundPrism’s dashboard uses OpenGL and provides each cell of 
the matrix with its own framebuffer object containing a unique 
camera and solitary planar mesh. This allows the cameras for each 
cell to be controlled manually or snapped to common orientations. 
Data is streamed into these mesh through the use of textures which 
are then sent into a vertex shader responsible for deforming the 
subsequent meshes. This allows for fast simultaneous rendering of 
up to 16 different graphs. 
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The Domain Specific Language is implemented as a series of 
macros which parse contents of the user’s DSL coded files and 
transpiles them into the syntax used by Essentia. 

2.1. Algorithms Provided By Essentia 2 

FFT, DCT, frame cutter, windowing, envelope, smoothing, 
low/high/band pass, band reject, DC removal, equal loudness, 
median, mean, variance, power means, raw and central moments, 
spread, kurtosis, skewness, flatness, duration, loudness, LARM, 
Leq, Vickers' loudness, zero-crossing-rate, log attack time and 
other signal envelope descriptors, Bark/Mel/ERB bands, MFCC, 
GFCC, LPC, spectral peaks, complexity, rolloff, contrast, HFC, 
inharmonicity and dissonance, Pitch salience function, predominant 
melody and pitch, HPCP (chroma) related features, chords, key and 
scale, tuning frequency, beat detection, BPM, onset detection, 
rhythm transform, beat loudness, danceability, dynamic complexity, 
audio segmentation, SVM classifier 

2.2. DSL Example: MFCC 

Mel-frequency cepstral coefficients (MFCCs) are particularly 
useful. Flere’s how an MFCC data flow would look using 
SoundPrism’s DSL. 

// File: mfcc.mdf 

audio{source: PCM} 
fc{algo: FrameCutter 
size: 2048} 
win(algo: Window} 
specfalgo: Spectrum} 
mfcc(algo: MFCC) 

audio -> fc -► win -► spec -» data:mfcc 

The code above sets up a data flow that begins from the PCM, is 
cut into a frame of data, windowed, spectrum analyzed, and then 
MFCC is calculated. The output of the MFCC calculation is stored 
in a thread-safe database under the id of “mfcc”. By adding these 
files, custom data flows become available to be placed within the 
dashboard and the data outputs are available for broadcasting via 
OSC. 

3. CONCLUSIONS 

SoundPrism demonstrates the usefulness of a robust platform for 
collecting, analyzing, and visualizing signal analysis and music 
information retrieval data. By making the data available to external 
applications, a tremendous number of potential applications are 
enabled but the following two are particularly interesting: 


2 https://essentia.upf.edu/documentation/documentation.html 


3.1. Music Visualization 

By allowing for more intelligent understanding of the audio signal, 
more meaningful ways of relating audio and visual outputs can be 
created. This can be used as a basis for a more sophisticated 
platform for music visualizations. 

3.2. Machine Listening 

Tools like SoundPrism may prove useful as a component or 
prototyping tool for the design and implementation of Machine 
Listening systems. Machine Listening takes several techniques and 
technologies, like signal analysis, music information retrieval, and 
machine learning, and then fuses them with contemporary 
knowledge of cognitive science in efforts to simulate the human 
ear-brain system. A software like SoundPrism could prove to be a 
useful tool for generating datasets for the training of neural 
networks. It would also be useful as a platform for quickly 
experimenting with chains of signal analysis. 

3.3. Sound Events 

By providing the user with simplified access to the multiplicity of 
sound they are able to begin using sound in more intelligent ways. 
This opens the possibility for radical new forms of creativity and 
composition where sound features could be used as hooks caught 
by an event handler. 
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